Vmware vSphere offers two amazing solutions in regard to ensuring high availability. One of them is called High Availability and the second one Fault Tolerance feature. Both of them complement each other providing uninterrupted working of the virtual machines in cluster.
Using HA involves deploying the Cluster. Cluster as you know is a bunch of ESX hosts grouped in order to simplify management. One of the hosts becomes the Master, the first condition in regard to choosing the Master is which ESX has access to the bigger size of datastore. The Master uses “Heartbeat” in order to check if the other hosts are available and doing this with Management network. Heartbeat may be used also to monitoring if another host has still access to datastore. Each ESX host writes and modyfies some small files on datastore periodically. The Master checks the timestamps of that files and thanks to that knows if the other hosts are able to makes changes to the datastore, what means that given host is active. It may happen that ESX host losses connectivity on Management network bust still has access on Storage network. We call it Isolation. If the Esxi host wants to make sure if is isolated then sends ping towards the Isolation Address what is nothing more than its default gateway.
When we are talking about HA we have to mention about VCMP – Virtual Machine Component Protection that is a service that is looking for a problems between ESX and Storage. Within VCMP we have:
Permament Device Loss (PDL) – permament lack of conectivity to the storage, the problem resides directly in ESX component NIC or HBA etc.
All Paths Down (APD) – transient lack of connectivity to the storage, components of ESX are working, the problem resides on the Storage or network.
What kind of high availability offer HA for us?
Monitoring ESX hosts – moving virtual machines on different ESX host
Monitoring Virtual Machines and restarting them in case of failure on the same ESX host
Monitoring running applications if they didn’t crashed and restarting virtual machines that applications are running on
Mirroring 2 the same virtual machines on 2 different ESX hosts. We have primary and secondary host. In case of failure switching between 2 hosts takes miliseconds. In order to use Fault Tolerance we have to run :
– High Availability
– Distributed Resource Scheduler (DRS) but we don’t use DRS regarding the Storage only host resources! VMX configuration files we have to keep on shared storage, VMDK with system files we don’t have to.
– Enhanced vMotion Component (EvC) – if we don’t have fully compatible processors on Fault Tolerance hosts, EvC will mask and hide these features that don’t occur in both processors.