A failover cluster is a set of computer servers that work together to provide either high availability (HA) or continuous availability (CA). If one of the servers goes down, another node in the cluster can assume its workload with either minimum or no downtime through a process referred to as failover.
Some failover clusters use physical servers only, whereas others involve virtual machines (VMs).
The main purpose of a failover cluster is to provide either CA or HA for applications and services. Also referred to as fault tolerant (FT) clusters, CA clusters allow end users to keep utilizing applications and services without experiencing any timeouts if a server fails. HA clusters might cause a brief interruption in service for customers, but the system will recover automatically with no data loss and minimum downtime.
A cluster is made up of two or more nodes—or servers—that transfer data and software to process the data through physical cables or a dedicated secure network. Other kinds of clustering technology can be used for load balancing, storage, and concurrent or parallel processing. Some implementations combine failover clusters with additional clustering technology.
To protect your data, a dedicated network connects the failover cluster nodes, providing essential CA or HA backup.
How Failover Clusters Work
While CA failover clusters are designed for 100 percent availability, HA clusters attempt 99.999 percent availability—also known as “five nines.” This downtime amounts to no more than 5.26 minutes yearly. CA clusters offer greater availability, but they require more hardware to run, which increases their overall cost.
High Availability Failover Clusters
In a high availability cluster, groups of independent servers share resources and data throughout the system. All nodes in a failover cluster have access to shared storage. High availability clusters also include a monitoring connection that servers use to check the “heartbeat” or health of the other servers. At any time, a least one of the nodes in a cluster is active, while at least one is passive.
In a simple two-node configuration, for example, if Node 1 fails, Node 2 uses the heartbeat connection to recognize the failure and then configures itself as the active node. Clustering software installed on every node in the cluster makes sure that clients connect to an active node.
Larger configurations may use dedicated servers to perform cluster management. A cluster management server constantly sends out heartbeat signals to determine if any of the nodes are failing, and if so, to direct another node to assume the load.
Some cluster management software provides HA for virtual machines (VMs) by pooling the machines and the physical servers they reside on into a cluster. If failure occurs, the VMs on the failed host are restarted on alternate hosts.
Shared storage does pose a risk as a potential single point of failure. However, the use of RAID 6 together with RAID 10 can help to ensure that service will continue even if two hard drives fail.
If all servers are plugged into the same power grid, electrical power can represent another single point of failure. The nodes can be safeguarded by equipping each with a separate uninterruptible power supply (UPS).
Continuous Availability Failover Clusters
In contrast to the HA model, a fault-tolerant cluster consists of multiple systems that share a single copy of a computer’s OS. Software commands issued by one system are also executed on the other systems.
CA requires the organization to use formatted computer equipment and a secondary UPS. In a CA failover cluster, the operating system (OS) has an interface where a software programmer can check critical data at predetermined points in a transaction. CA can only be achieved by using a continuously available and nearly exact copy of a physical or virtual machine running the service. This redundancy model is called 2N.
CA systems can compensate for many different sorts of failures. A fault tolerant system can automatically detect a failure of
- A hard drive
- A computer processor unit
- A I/O subsystem
- A power supply
- A network component
The failure point can be immediately identified, and a backup component or procedure can take its place instantly without interruption in service.
Clustering software can be used to group together two or more servers to act as a single virtual server, or you can create many other CA failover setups. For example, a cluster might be configured so that if one of the virtual servers fails, the others respond by temporarily removing the virtual server from the cluster. It then automatically redistributes the workload among the remaining servers until the downed server is ready to go online again.
An alternative to CA failover clusters is use of a “double” hardware server in which all physical components are duplicated. These servers perform calculations independently and simultaneously on the separate hardware systems. These “double” hardware systems perform synchronization by using a dedicated node that monitors the results coming from both physical servers. While this provides security, this option can be even more expensive than other options. Stratus, a maker of these specialized fault tolerant hardware servers, promises that system downtime won’t amount to more than 32 seconds each year. However, the cost of one Stratus server with dual CPUs for each synchronized module is estimated at approximately $160,000 per synchronized nodule.
Practical Applications of Failover Clusters
Ongoing Availability of Mission Critical Applications
Fault tolerant systems are a necessity for computers used in online transaction processing (OLTP) systems. OLTP, which demands 100 percent availability, is used in airline reservations systems, electronic stock trading, and ATM banking, for example.
Many other types of organizations also use either CA clusters or fault tolerant computers for mission critical applications, such as businesses in the fields of manufacturing, logistics, and retail. Applications include e-commerce, order management, and employee time clock systems.
For clustering applications and services requiring only “five nines” uptime, high availability clusters are generally regarded as adequate.
Disaster Recovery
Disaster recovery is another practical application for failover clusters. Of course, it’s highly advisable for failover servers to be housed at remote sites in the event that a disaster such as a fire or flood takes out all physical hardware and software in the primary data center.
In Windows Server 2016 and 2019, for example, Microsoft provides Storage Replica, a technology that replicates volumes between servers for disaster recovery. The technology includes a stretch failover feature for failover clusters spanning two geographic sites.
By stretching failover clusters, organizations can replicate among multiple data centers. If a disaster strikes at one location, all data continues to exist on failover servers at other sites.
Database Replication
According to Microsoft, the company originally introduced Windows Server Failover Cluster (WSFC) in Windows Server 2016 to protect “mission-critical” applications such as its SQL Server database and Microsoft Exchange communications server.
Other database providers offer failover cluster technology for database replication. MySQL Cluster, for example, includes a heartbeat mechanism for instant failure detection, typically within one second, to other nodes in the cluster with no service interruptions to clients. A geographic replication feature enables databases to be mirrored to remote locations.
Failover Cluster Types
VMWare Failover Clusters
Among the virtualization products available, VMware offers several virtualization tools for VM clusters. vSphere vMotion provides a CA architecture that exactly replicates a VMware virtual machine and its network between physical data center networks.
A second product, VMware vSphere HA, provides HA for VMs by pooling them and their hosts into a cluster for automatic failover. The tool also does not rely on external components like DNS, which reduces potential points of failure.
Windows Server Failover Cluster (WSFC)
You can create Hyper-V failover servers with the use of WFSC, a feature in Windows 2016 and 2019 that monitors clustered physical servers, providing failover if needed. WFSC also monitors clustered roles, formerly referred to as clustered applications and services. If a clustered role isn’t working correctly, it is either restarted or moved to another node.
WFSC includes Microsoft’s previous Cluster Shared Volume (CSV) technology to provide a consistent, distributed namespace for accessing shared storage from all nodes. In addition, WSFC supports CA file share storage for SQL Server and Microsoft Hyper-V cluster VMs. It also supports HA roles running on physical servers and Hyper-V cluster VMs.
SQL Server Failover Clusters
In SQL Server 2017, Microsoft introduced Always On, an HA solution that uses WSFC as a platform technology, registering SQL Server components as WSFC cluster resources. According to Microsoft, related resources are combined into a role which is dependent on other WSFC resources. WSFC can then identify and communicate the need to either restart a SQL Server instance or automatically fail it over to a different node.
Red Hat Linux Failover Clusters
OS makers other than Microsoft also provide their own failover cluster technologies. For example, Red Hat Enterprise Linux (RHEL) users can create HA failover clusters with the High Availability Add-On and Red Hat Global File System (GFS/GFS2). Support is provided for single-cluster stretch clusters spanning multiple sites as well as multi-site of “disaster-tolerant” clusters. The multi-site clusters generally use storage area network (SAN)-enabled data storage replication.