In the commercial Unix marketplace, High Availability (HA) is today key to selling server solutions. Virtually every Unix supplier has their own HA software solution to provide customers with near-fault-tolerant server systems at moderate prices. As a rule of thumb, redundancy is used to prevent the overall IT system from having single points of failure ( SPOF), a method that has been common in space flight and general aviation for decades. The common objective is to mask unplanned outages from users in a manner to let users continue to work quickly.
As an alternative, administrators could set up action plans describing who has which task in case of an outage. But if no appropriately skilled person is on duty or if the operator makes an error, your business is in danger. HA software replaces this error-prone manual process by a set of scripts and tools which perform the same failover activities again and again in a reproducible and reliable manner in much shorter time than any manual operator.
Frankly, everyone who runs his business on a 24 hour by 7 days a week basis which must not suffer from outages for longer than just a couple of minutes or maybe half an hour. Unplanned outages can severely hamper your operations. Two 1995 studies by Oracle Corp. and Datamation showed that average businesses lost between 80,000 and 350,000 USD per hour of unplanned outages. After the 1993 World Trade Center bombing, 145 of 350 businesses which were located in the building had to close down within a year because they had no redundant IT structure.
Keeping these numbers in mind, it is obvious that setting up a redundant IT structure comes at a cheaper price than the risk of even a short outage. This is especially true when considering the relatively low prices of Intel-based PCs running a freely available POSIX-compatible Unix-like operating system. Plus, administrators know exactly how expensive the additional machinery, software and operator education is whereas the cost of unplanned outages can never be known before.
Since server downtimes can be unplanned ("failures") as well as planned (service downtimes), I'll talk about "outages" throughout this document except if explicitly expressing an unplanned downtime.
With the 2.0 kernel release, Linux has many features which are needed by a HA software solution. Some additional features have to be added, though, and that's what this document is about. The most critical feature is a robust, transaction oriented filesystem which allows for fast filesystem checks in case of a node failure and takeover. A log structured filesystem is currently being worked on -- please see the Filesystems subsection for more information.
The basic rule to make a server system HA is to use redundant parts where needed and affordable. Just like the basic idea behind RAID (Redundant Array of Independent Disks), all parts in a server are subject to being a single point of failure: the CPU, the power supply, the mainboard, the main memory, adapter cards et cetera, i.e. all parts whose failure renders the overall system unusable.
For example a single network adapter card in a server system which works in a client/server environment is a SPOF for this server. Likewise, a single SCSI adapter connecting to an external storage system is a SPOF. If a complete server fails amongst a group of several servers, and the failed server cannot be easily and quickly replaced by another server, then this server is a SPOF for the server group or cluster.
The solution is quite straightforward: Adapter cards can be made redundant by simply doubling them within a server and making sure a backup adapter becomes active if the primary one fails.
CPU, power supply and other parts can be made redundant within a server too, but this requires special parts that are not very common in the PC environment and thus quite expensive. However in cooperation with a software agent (the HA software), two or more servers in a HA cluster can be set up to replace each other in case of a node outage.
Please note that the Linux-HA software alone is not the universal answer to customers' availability requirements. Other parts like routers, bridges etc. must be set up in a redundant manner as well.
At this point, many customers stop thinking about redundancy. However, if the cluster nodes are placed side-by-side, a power outage or a fire can affect both. Consequently, an entire building, a site or even a town (earthquakes or other natural disasters) can be a SPOF! Such major disasters can be handled by "simply" locating the backup node(s) at a certain distance from the main site. This can be very costy, so that IT users have to carefully evaluate their situation and decide which SPOFs to cover.
I happen to work as a second level technical support specialist for IBM, working with IBM's High Availability & Clustering software solutions, namely HACMP for AIX (High Availability Cluster Multi-Processing). There are numerous similar software solutions available from the other major Unix suppliers like HP, Sun, DEC and others but since I know HACMP best, I will use its terminology & concepts throughout this document. As time goes by, I expect folks who are familiar with other commercial HA solutions to join. My intent is not to keep the IBM HACMP terminology at any price but please don't let us start lengthy meta discussions about this!
If you like, you can look up the "HACMP V4.2 Concepts & Facilities Guide" (IBM form number SC23-1938-00) to learn more about the concepts involved.
Generally, I don't think Linux-HA is going to be in real competition with commercial Unix HA solutions very soon. These packages have all been available for a long time, and most of them are quite mature and used for their stability. But who knows. Initially, the target is NT clustering...
For general information on Linux see the Linux Home Page and the Linux International page.