The first article in this new series, provided a brief overview of two main high-availability technologies incorporated into Windows 2003 Server operating system platform: server clustering and network load balancing. We focused on the former, describing some of the basic terminology and elementary design principles.
We return to server clustering, this time looking at inter-node communication, specifically, Quorum. Quorum is at the heart of the three main server clustering models: Single Shared Quorum, Single Local Quorum, and Majority Node Set Quorum.
As explained, among the most important ones is maintaining a single instance of each clustered resource (ensuring at the same time its fault tolerance and preventing “split-brain” scenarios). This is accomplished through two basic mechanisms, resource virtualization and inter-node communication.
Resource virtualization requires each clustered service or application be represented by a number of related software and hardware components, such as disks, IP addresses, network names, and file shares, which can be assigned to any server participating in the cluster and easily transferred between them, if necessary. This is made possible by setting up these servers in a very specific manner, where they can access the same set of shared storage devices, reside on the same subnet, and are part of the same domain. For example, to create a highly available network file share, you would identify a shared disk drive hosting the share, an IP address (with corresponding network name) from which the share can be accessed remotely, and target file share, with its name and access permissions.
Although this sounds complicated and time consuming, all necessary resources are pre-defined, making this procedure fairly straightforward. Once resources are identified and configured (by specifying disk drive letters, assigning unique IP addresses, network names, or file share characteristics), they can be assigned to any server participating in the cluster (as long as each one is capable of supporting them). Resources can then be easily moved between nodes in case the one currently hosting them fails.
The Importance of Quorum
Inter-node communication is facilitated through heartbeat signals carried over redundant network connections between cluster members and through Quorum’s presence, which determines how resource ownership should be handled. As we pointed out, Quorum has the additional important function of storing the most up-to-date cluster configuration, copied subsequently to a dedicated registry hive on each node. Local copies are referenced when nodes join the cluster during startup. Because of its significance in clustering architecture, Quorum also serves as the basis for three main server clustering models:
- Single Shared Quorum: Quorum is implemented as the Physical Disk clustered resource.
- Single Local Quorum: Quorum is implemented as the Local Quorum clustered resource.
- Majority Node Set Quorum: Quorum is implemented as the Majority Node Set clustered resource.
Single Shared Quorum clusters are by far most popular among server cluster implementations. They most closely match the traditional clustering design (which is reflected by continuing support for this model since introduction of Microsoft Cluster Server in Windows NT 4.0 Server Enterprise Edition), offering high-availability of resources representing wide variety of services and applications as well as simplicity of installation and configuration.
As their name indicates, Single Shared Quorum clusters use storage design, which enables them to access the same set of disks from every cluster member. While underlying hardware varies widely (and might involve such types of technologies as SCSI, SANs, NAS, or iSCSI, which we will review more closely in our next article), the basic premise remains the same.
Only one instance of any specific resource is permitted at any given time within the cluster. The same applies to Quorum, located on a highly available disk volume, physically connected via a SCSI bus, Fibre Channel links, or network infrastructure to all servers participating in the cluster. Ownership of the shared volume is arbitrated to ensure it is granted only to a single node, thus preventing other nodes from accessing it at the same time (such situation would likely result in data corruption).
This arbitration is typically be handled using internal SCSI commands (such as SCSI reserve and SCSI release) as well as bus, Target, or Logical Unit Number (LUN) resets. The specifics depend on the type of storage technology implemented. Note that support for a clustering installation is contingent on strict compliance with the Hardware Compatibility List (which is part of the Windows Server Catalog, containing all clustering solutions certified by Microsoft). Therefore it is critical that you verify which system you intend to purchase and deploy. Quorum, in this case, is implemented as the Physical Disk resource, which requires having a separate volume accessible to all cluster nodes (clustering setup determines automatically whether the volume you selected satisfies necessary criteria).
Unfortunately, the majority of hardware required to set up clustered servers is relatively expensive (although prices of such systems are considerably lower than they were a few years ago), especially if the intention is to ensure redundancy for every infrastructure component, including Fibre Channel and network devices, such as adapters and switches, or disk arrays and their controllers. The cost might be prohibitive, especially for programmers whose sole goal is developing cluster-aware software or exploring the possibility of migrating existing applications into clustered environment.
To remediate this issue, Microsoft made such functionality available without specialized hardware setup, by allowing the installation of a cluster on a single server with local storage only (known also as a single node cluster). Obviously, such configuration lacks any degree of high availability, but it has all features necessary for application development and testing. Since local disks are not represented as Physical Disk resources, this clustering model requires using a distinct resource type called Local Quorum when running New Server Cluster Wizard during initial setup, which we will review in details later.
Despite the benefits mentioned earlier (such as a significant level of high availability and compatibility with a variety of hardware platforms, applications, and services), Single Shared Quorum has limitations. The first one is inherent to the technologies used to implement it. For example, configurations relying on SCSI-based shared storage are restricted by the maximum length of the SCSI bus connecting all cluster nodes to the same disk array (which typically forces you to place them in the same or adjacent data center cabinets). This distance can be increased considerably by switching to a Fibre Channel infrastructure, but not without significant impact on hardware cost. Introducing iSCSI and NAS into the arsenal of available shared storage choices provides the same capability at lower prices, but there are still some caveats that restrict their widespread use (e.g., NAS devices are not supported as the Quorum resource). The second limitation that despite redundancy on the disk level (which can be accomplished through RAID sets or duplexing, with fault-tolerant disks and controllers), Single Shared Quorum still constitutes a single point of failure.
There are third-party solutions designed to address both of these limitations, and with release of Windows 2003 Server-based clustering, Microsoft introduced its own remedy in the form of Majority Node Set (MNS) Quorum. Like Local Quorum, MNS is defined as a separate resource that must be selected during cluster setup with New Server Cluster Wizard. Also like Local Quorum model, dependency on the shared storage hosting Quorum resource is eliminated, without having a negative impact on high availability.
The level of redundancy is increased by introducing additional copies of Quorum stored locally on each node (in the %SystemRoot%ClusterMNS.%ResourceGUID%$%ResourceGUID%$MSCS folder, where %ResourceGUID% designates a 128-bit unique identifier assigned to the cluster at its creation). As you can expect, having more than one Quorum instance requires different approach to preventing “split-brain” scenario. This is handled by defining a different rule that determines when cluster is considered operational (which, in turn, is necessary to make its resources available for client access). For this to happen, more than the half of cluster nodes must be functioning properly and be able to communicate with each other. The formula used to calculate this number is:
[(total number of nodes in MNS cluster)/2] + 1 |
where the square brackets denote Ceiling function, returning smallest integer equal to or larger than the result of dividing total number of nodes by two. For example, for a five-node cluster, three nodes would need to be running and communicating for its resources to be available (the same would apply to a four-node cluster). Clearly, setting up a two-node MNS cluster, although technically possible, does not make much sense from availability perspective (since one node’s failure would force the other one to shut down all of its resources). For an MNS cluster to function, at least two servers (in an three-node cluster) must be operational (note that with a Single Shared Quorum, a cluster might be capable of supporting its resources even with one remaining node).
Effectively, the rule guarantees that at any given point there will be no more than a single instance of every cluster resource. Clustering service on each node is configured to launch at boot time and to try to establish communication with majority of other nodes. This process is repeated every minute if the initial attempt fails.
This solution introduces additional requirements, since its architecture implies existence of multiple copies of the clustered data (unlike with Single Shared Quorum model), which must be consistently maintained. Although the clustering software itself is responsible for replication of Quorum configuration across all nodes, this does not apply to services and application-specific data. In general, there are two ways of handling this task. The first one relies on mechanisms built into the application (e.g., log shipping in SQL Server 2000/2005 deployments). The second one involves setting up replication on file system or disk block level. This can be handled through software or hardware, a topic we plan to elaborate on in the next article.
In addition, since clustered resources are virtualized, some of the restrictions placed on the Single Shared Quorum model still apply. In particular, for resource failover to take place, nodes must be able to detect failure of others through the absence of heartbeat signals. This requires round trip latency between nodes be no longer than 500 ms- affecting, in turn, the maximum allowed distance between them. They also must be members of the same domain and their public and private network interfaces have to reside (respectively) on the same subnets (which can be accomplished through setting up two VLANs spanning multiple physical locations hosting cluster nodes).
Furthermore, since Quorum updates are handled via network file shares called %ResourceGUID%$ (associated with the Quorum location listed earlier), both Server and Workstation services (LanManServer and LanManWorkstation, respectively) must be running on all cluster nodes and File and Printer Sharing for Microsoft Networks must be enabled for both private and public network connections.
Thus, when designing an architecture it is important to keep in mind the impact the architectural design will have on availability of the MNS cluster. For example, setting up two sites separated by a network link with an equal number of nodes in each will cause both to fail if communication between them is severed (since neither one contains majority of nodes). It might be beneficial in such situation to set up a third site with a single cluster node in it (and dedicated network links to the other two sites), dedicated exclusively to establishing majority node count when needed. Alternatively, you can also force some of the cluster nodes to host resources, although this requires manual intervention.