Apache Cloudstack High Availability Issues

Hi Everyone!

We are building up a hyperconvergent Cluster in our Environment using Cloudstack and have already configured and installed all necessary parts of the system to get it working and to actually start some virtual Machines inside different domains etc. So our setup is as following:

  • Cloudstack Management Server as separate Host
  • 4 Hosts as Cloudstack Hosts in a single Cluster
  • All 4 Hosts with multiple OSD’s creating a single Ceph-Pool used as the Primary Storage of Cloudstack
  • A NAS connected via NFS as Secondary Storage

So if we are running some virtual Guests distributed over the 4 hosts and we actually disconnect / pull the power cable out of one of the hosts, then the following happens:

  • First of All, the Out-of-Band-Management says “State: Unknown” of course
  • The Resource State is still Enabled and the actual State stays up for quite some Minutes
  • Then the state switches to “Alert”, the Virtual Guests on that Hosts are still declared as “up and running” though which of course is not true
  • The Virtual Guests cannot be shut down softly, as they are not reachable, so they only change their state from running to anything else if they are being put down “forced”.
  • After that they are obviously “stopped” and can be started from zero on the next best host online.
  • That is everything that happens regarding “High Availability” in this scenario.

The VMs are all set to “HA enabled” state, the Cluster itself is set to “HA enabled” and we tried putting the Hosts themselves into HA as well but they just switch between “suspect” and “degraded” state.

We just want to have the VMs being started automatically on another hosts like the System VMs, which the “VM HA” Mode should actually provide by itself. Would be nice if Cloudstack would even recognize the VMs to be down at all. We let Cloudstack check over night just to be sure and the next morning, the state was still the same with Host being “alert” but resource “enabled” and the dead VM’s still “running”.

Has anyone an idea if this is a constellation problem with Cloudstack/KVM/Ceph etc or if we’re just missing something here?

Thanks in advance and best Regards