vCenter Server 6.5 High availability (vCHA)

VMware vCenter Server sits at the heart of vSphere and provides services to manage virtual infrastructure components like ESXi hosts, virtual machines, storage and networking resources centrally. vCenter Server is an important element in ensuring the business continuity of virtual infrastructure. vCenter Server must be protected from a set of hardware and software failures in an environment and must be recovered transparently from such failures. With vSphere 6.5, VMware introduced high availability solution for vCenter Server, known as vCenter Server High Availability(VCHA). vCenter High Availability (vCHA), is exclusively available for the vCenter Server Appliance (VCSA) and not for windows based vCenter Server deployment.

From an architecture perspective, vCenter HA supports both embedded and external Platform Services Controllers. 

  • An embedded Platform Services Controller instance can be used when there are no other vCenter Server or Platform Services Controller instances within the single sign-on domain. 
  • External Platform Services Controller instance is required when there are multiple vCenter Server instances in an Enhanced Linked Mode configuration. 
When using vCenter HA with an external Platform Services Controller deployment, an external load balancer is required to provide high availability to the Platform Services Controller instances. Supported load balancers for Platform Services Controller instances in vSphere 6.5 include VMware NSX, F5 BIG-IP LTM, and Citrix NetScaler.
 
The vCenter High Availability architecture uses a three-node cluster to provide availability against multiple types of hardware and software failures. 

 

Image: VMware
  • A vCenter HA cluster consists of one Active vCenter Server node that serves client requests.
  • One Passive node to take the role of Active node in the event of active vCenter Server node failure. 
  • One quorum node called the Witness node to solve the classic split-brain problem due to network failures within distributed systems maintaining replicated data. 

Traditional architectures use some form of shared storage to solve the split-brain problem. However, in order to support a vCenter HA cluster spanning multiple datacenter’s, vCHA design does not assume a shared storage–based deployment. As a result, one node in the vCenter HA cluster is permanently designated as a quorum node, or a Witness node. The other two nodes in the cluster dynamically assume the roles of Active and Passive nodes. 

vCenter Server availability is assured as long as there are two nodes running inside a cluster. However, a cluster is considered to be running in a degraded state if there are only two nodes in it. A subsequent failure in a degraded cluster means vCenter services are no longer available. 

A vCenter Server appliance is stateful and requires a strong, consistent state for it to work correctly. The appliance state (configuration state or runtime state) is mainly composed of:
  • Database data (stored in the embedded PostgreSQL database) 
  • Flat files (for example, configuration files). 
The appliance state must be backed up in order for vCHA failover to work properly. For the state to be stored inside the PostgreSQL database, vCHA uses the PostgreSQL native replication mechanism to keep the database data of the primary and secondary in sync. For flat files, a Linux native solution, rsync, is used for replication. Because the vCenter Server appliance requires strong consistency, it is a strong requirement to utilize a synchronous form of replication to replicate the appliance state from the Active node to the Passive node.   

 

Image: VMware


A design assumption of low latency and high bandwidth network connectivity between the Active and Passive node is made to guarantee zero recovery point objective (RPO).

A vCHA cluster requires a vCHA network that is separate from the management network for the vCenter Server appliance. Clients can have access to the Active vCenter Server appliance via the management network interface, which is public.

Roles for each type of node in a vCenter HA cluster, as shown above Figure, are:

Active Node: 

  • Runs the active instance of vCenter Server.
  • Enables and uses the public IP address of the cluster. 

Passive Node: 

  • Runs as the passive instance of vCenter Server. 
  • Constantly receives state updates from the Active node in synchronous mode. 
  • Equivalent to the Active node in terms of resources. 
  • Takes over the role of Active Node in the event of failover. 

Witness Node: 

  • Serves as a quorum node.
  • Used to break a tie in the event of a network partition causing a situation where the Active and Passive nodes cannot communicate with each other. 
  • A light-weight VM utilizing minimal hardware resources. 
  • Does not take over role of Active/Passive nodes. 

In the event of the Active vCenter Server appliance failing due to hardware, software, or network failures, the Passive node appliance takes over the role of the Active node, assumes the public IP address for the cluster, and starts serving client requests. Meanwhile, it is expected that the clients need to re-log into the appliance for continued access. Because the HA solution utilizes synchronous database replication, there will never be any data loss during failover (RPO = 0).

Availability of the vCenter Server appliance failure conditions:

Active node failure:

  • As long as the Passive node and the Witness node can communicate with each other, the Passive node will promote itself to Active and start serving client requests. 

Passive node failure:

  • As long as the Active node and the Witness node can communicate with each other, the Active node will continue to operate as Active and continue to serve client requests. 

Witness node failure:

  • As long as the Active node and the Passive node can communicate with each other, the Active node will continue to operate as Active and continue to serve client requests. 
  • The Passive node will continue to watch the Active node for failover. 

More than one node failure or is isolated:

  • This means all three nodes(Active, Passive & Witness) cannot communicate with each other.
  • This is more than a single point of failure and when this happens, the cluster is assumed non-functional and availability is impacted because VCHA is not designed for multiple failures. 

Isolated node behaviour:

  • When a single node gets isolated from the cluster, it is automatically taken out of the cluster and all services are stopped. For example, if an Active node is isolated, all services are stopped to ensure that the Passive node can take over as long as it is connected to the Witness node.
  • Isolated node detection takes into consideration intermittent network glitches and resolves to an isolated state only after all retry attempts have been exhausted. 

A client connecting to the vCenter Server appliance will use the public IP address. In the event of failover, the Passive node will take over the exact personality of the failed Active node including the public IP address. The target recovery time objective (RTO) is in minutes about 5 minutes, during which clients should be ready to receive failures.

There are two modes in which vCHA can be deployed.

  • Basic mode.
  • Advanced mode.
Basic Mode:
  • The basic workflow can be used in most scenarios in which all vCHA nodes run within the same cluster. 
  • This workflow automatically creates the passive and witness nodes. 
  • It also creates vSphere DRS anti-affinity rules if vSphere DRS is enabled on the destination cluster and uses VMware vSphere Storage DRS for initial placement if enabled. 
  • Some flexibility is provided in this workflow, so you can choose specific destination hosts, datastores, and networks for each node. This is a simple way to get a vCHA cluster up and running.
  • The Basic step by step walkthrough is available here at VMware Walkthrough portal.

Advanced Mode:

  • The Advanced workflow is an alternative that can be used when the active, passive, and witness nodes are to be deployed to different clusters, vCenter Server instances, or even other datacenter’s. 
  • This process requires the customer to manually clone the source vCenter Server instance for the passive and witness nodes and to then place those nodes in the chosen locations with the appropriate IP address settings. 
  • This process, but provides greater flexibility for customers who needs it.
  • The Advanced step by step walkthrough is available here at VMware Walkthrough portal.

Leave a Reply