Mainboards used for Linux-HA enabled servers should have a sufficient number of I/O slots to ensure the number of network and disk adapters is sufficient for the redundancy needed. In detail, you will probably need at least 2 disk adapters and 2 network adapters. If all cards are PCI, this sums up to at least 4 PCI slots. For more complex setups, probably more PCI slots are needed. For slow-speed networks (10 MBit/s Ethernet, 16 MBit/s Token Ring), possibly ISA cards can be used in order to free PCI slots for the disk adapters. There are also multifunction cards around which can potentially save slots, e.g.
The overall card is a SPOF, tough.
SMP machines are well suited for high performance computing -- at the cost of having to reboot if a CPU fails. Statistically, a 4 CPU machine will fail 4 times more often than single CPU ones. Statistically...
The use of ECC capable mainboards and parity memory
modules is recommended. Gabriel Paubert (
firstname.lastname@example.org) will start working on a
ECC handler for the Linux kernel.
Since we are talking about servers systems, no PCI video cards are needed -- cheap ISA cards with simple 14 inch B&W monitors will do the job. Serial terminals may also fit but need a serial port.
Since we are talking about business critical systems, there is probably no way for customers to choose their Linux distribution of choice. Distributions are too different in stability and handling to allow for that. Major Linux distribution makers are invited to talk to us about what to do.
An interesting new development is the FreeLinux Project which might suit the needs of Linux-HA. Another interesting idea is the LINNET proposal.
VAR's offering the system to end customers should only support a limited range of kernel releases. This also accomodates Pathlight's concern of not releasing source code (see section Pathlight Technology Inc.).
A different approach may be the following: major Linux distribution makers contribute changes to their distribution to make sure Linux-HA integrates smoothly. This requires some more coordination but makes life easier for users and system integrators. Or they integrate Linux-HA into their distributions and offer configuration and installation service.
Remark: In the following sections, a 2-node cluster will mostly be used to illustrate the concepts. Please keep in mind that we want to be open for more nodes, which means failover concepts must provide some more logic than in these simplified examples.
The main objective is to minimize application downtimes and keep the overall cluster in a consistent state. Therefore, the Linux-HA software will consist of several modules which will run on all machines:
syslogdor by other means. The daemon will communicate with his counterparts on other cluster nodes and send/receive SNMP traps. The cluster manager daemon will be written in a compiled language (e.g. C) for performance and integrity reasons.
Documentation/watchdog.txtfor more information), either with or without a hardware watchdog card. For more flexibility, I asked Alan Cox to change the watchdog function to allow the user to write the number of seconds and what to do into
will halt (not reboot) the machine after 60 seconds, or
echo "60 H" > /dev/watchdog
will reboot it after 10 seconds. The reason for the change request is, I want Linux-HA to be able to either reboot or halt the machine, depending on the HA configuration. In some situations it makes sense to halt a machine (e.g. if something goes wrong but the network card is still active which prevents a clean IP address takeover to another card). Halting a machine will make sure all resources are released. On the other hand, when we have rotating resources, we may want to reboot the machine and start HA automatically, e.g. in a web server or firewall configuration. And the timeout will need to be set individually as well. This mechanism may seem a bit rude but the objective is clear. The overall cluster must remain in a consistent state, and if one of the bits & pieces on a node fails it is unclear whether or not this node will behave normally in case of other events. So it is better to disable it as quickly as possible to ensure another node takes over.
echo "10 R" > /dev/watchdog
.rhosts). Instead, a secure protocol will be used, e.g. DCE RPC or the Secure Shell. The Config & Admin tool might care for proper setup of either of these on all nodes.
Commercial solutions allow for either starting the HA daemons on
reboot (e.g. via a
rc script or from
manually. In a production environment, you do not want to start HA on
reboot. Consider the primary server crashing for certain reasons. The
backup machine will take over, and the primary may reboot. If HA comes
up automatically, the primary node re-claims the resource group and
potentially crashes again and so on. Plus, you want to investigate the
crash before putting the crashed node in production again. So,
starting HA automatically on reboot is nice for customer
demonstrations but certainly not recommended for the production
The X/curses and command line interfaces will allow to set the start
mode for the next system start, i.e. insert an appropriate
Linux-HA will use cluster ID numbers which are common for all nodes which belong to a cluster. Nodes belonging to different clusters will/need to have different cluster IDs. This is handled via SNMP. When a node starts, it will query the network for any other living node with the same cluster ID number. If a living node or a cluster with the same cluster ID exists, the new node will attempt to enter the cluster but will only be allowed if all living nodes agree. If no living node answers, the new node will assume it is the first one and will acquire the resource groups for which it is the primary.
There will be three ways of stopping Linux-HA:
Linux-HA is forcibly halted, resources are not released
Linux-HA is gracefully stopped, resources are released but not taken over by another node
Linux-HA is gracefully stopped, resources are released and taken over by another node
Linux-HA will initially be completely generic, that is no application
will be specifically supported. Experience shows that specific
adaptations make such a HA solution inflexible. The only interface
between Linux-HA and applications will be the names of a start and
stop script or executable which will be executed by the
stop_server events, respectively.
Since you never know which application needs which permissions, everything has to run as root (keep security aspects in mind!) which means the start and stop scripts can do whatever we like them to. In principle, we can start any application that does not need user interaction during start. Everything that can be safely pushed in the background can thus be made highly available.
Stopping the application can be achieved by calling an application specific command, sending appropriate signals to processes etc.
This way, applications will not be supervised by the Linux-HA software itself, keeping it simple and flexible. The start script, however, can start a process which supervises the application and do what is needed to either attempt to restart it or initiate a controlled and graceful failover to a standby node which in turn restarts the application.
This means, integrating an application boils down to writing start and
stop scripts which can run in the background. The bottom line is, if
you can run these scripts from
cron, they will also work inside
At a later time, a clustering API might become useful which can be used by applications to control the cluster in certain ways or communicate with the cluster manager daemon.
syslogd will be used for reporting and detecting errors. In a
real-life setup, make sure
syslog is configured on each host to
log locally and to a remote loghost. Otherwise you may not be able to
find out what went wrong if a failed machine won't reboot.
It would be better to have a generic error logging interface like the AIX error logger. This way, one could handle errors from the kernel more easily. Currently, every change of a device driver or kernel error message would require a Linux-HA update.
It is generally a bad idea for users to log directly onto a cluster node.
If you login via TCP/IP (e.g.
telnet), the connection will be
lost in case of a failover. If you login via serial ports, connections
will be lost as well because there is no way to take over serial ports
except if you use network attached terminal adapters for which the
first rule applies.
SNMP will be used to enable remote agents to monitor and/or control a cluster. The package used will be the CMU SNMP toolkit.
The NTP protocol will be used to ensure consistent time on all nodes in a cluster. Otherwise, debugging events and errors will be really hard. Ideally, multiple radio receivers will be attached to some cluster nodes, and the NTP configuration file will be set up to synchronize the time from multiple sources. That way, timekeeping will have no SPOF. Public NTP servers on the Intranet or the Internet can also be used.
Here is the skeleton of the config file format I made. Some Linux specific parts are possibly missing but this is about where I'll start off. What we need now is a library to read/write these objects. Maybe the KConfig C++ class of KDE (Kool Desktop Environment) is suitable. Samba uses the same format but isn't written in C++ (I don't know very much about C++).
The configuration utility will write the ASCII file during configuration. Configuration will be done on one node only and then distributed (synchronized) to the other nodes. Each update will require another synchronization. This way, there will be no inconsistencies. During synchronization, the config tool will also convert the ASCII file (which may contain comments for readability) into GDBM, and all the HA tools will only use the GDBM version for speed.
Please note that Linux-HA will only parse the file during start. This way, the file will only hold static information which is needed when starting the cluster manager daemon. Runtime information will be kept in memory dynamically (e.g. node, adapter and network status), represented by SNMP variables.
Comments start with a hash.
# the first class is "adapter". The class stores information about all # sorts of network adapters (ethernet, rs232, fddi, tokenring etc. ) [adapter] # the type -- determines which network heartbeat module to use type = "ether" # the network it's attached to network = "ether1" # the node it belongs to nodename = "seneca" # the IP address/name it has (i.a.w. /etc/hosts) ip_label = "seneca" # its function (service, boot, standby) function = "service" # a MAC address if appropriate haddr = "0xdeadbeef0123" [adapter] type = "ether" network = "ether1" nodename = "seneca" ip_label = "seneca_stby" function = "standby" haddr = "" [adapter] type = "ether" network = "ether1" nodename = "seneca" ip_label = "seneca_boot" function = "boot" haddr = "" # class node stores the information of this cluster [cluster] # the cluster ID - must be unique within a logical network or subnet. Part # of the information in each heartbeat packet. id = 1 # cluster name - just a string name = "linuxtest" # this node nodename = "seneca" # number of nodes - starts with "1" highest_node_id = 2 # number of networks - starts with "0" highest_network_id = 1 # this is a leftover from a real HACMP cluster. # I am not quite sure if we can use this. HACMP starts some daemons # automatically, clinfo, the cluster info daemon (a SNMP client), # the cluster smux peer daemon clsmuxpd, etc. [daemons] nodename = "hawww1" daemon = "clinfo" type = "start" object = "time" value = "true" # class events hold all the events (see the event section in the HOWTO [event] # the event name name = "node_up_local" # a description desc = "Script run when it is the local node which is joining the clust" # some real HACMP data, dunno whether we need it. setno = 0 msgno = 0 # the message number from the NLS message catalog catalog = "" # the executable cmd = "/usr/sbin/cluster/events/node_up_local" # a notify script if appropriate notify = "" # a pre event script if appropriate pre = "" # a post event script if appropriate post = "" # the class group holds information about resource groups [group] # its name group = "linuxtest" # type cascading, rotating, (concurrent) type = "cascading" # participating nodes and their priority nodes = "seneca linha" # class networks describes the networks [network] # its name name = "rs232" # type (serial, public, private) ("network attribute") attr = "serial" # the network number as known to the cluster software network_id = 0 [network] name = "ether" attr = "public" network_id = 1 # class nim describes network modules. These handle the heartbeat in a # network specific manner (RS232 is different from IP, ATM is different # from Ethernet because it cannot do Multicast etc.) [nim] name = "ether" desc = "Ethernet Protocol" addrtype = 0 path = "/usr/sbin/cluster/nims/nim_ether" para = "" grace = 30 # heartbeat rate in microseconds, well ... hbrate = 500000 # if 12 are missing, an alert is created. cycle = 12 # class node describes the individual nodes [node] # node name name = "seneca" # do logging in a verbose manner (e.g. "set -x" in the event scripts or not) verbose_logging = "high" # node number node_id = 1 [node] name = "linha" verbose_logging = "high" node_id = 2 # class resource describes what belongs to a resource group. [resource] # name group = "linuxtest" # which service IP label(s) are in the RG service_label = "seneca" # all FS in a line can make updates complicated... # all FS to mount locally filesystem= "/usr/local/ftp /usr/local/etc/httpd" # all FS to export explicitly export_filesystem = "/usr/local/ftp /usr/local/etc/httpd" # which applications (separate class) to start/stop applications = "linuxtest" # acquire the resource group on a standby node if the primary isn't there # or not. not = false, yes = true. inactive_takeover = "false" # only for concurrent access. disk fencing makes sure only active nodes # can access the shared disks ssa_disk_fencing = "false" # class server describes application servers [server ] # the name (referenced e.g. in class resource) name = "linuxtest" # the name of a start and stop script. Will run as root, in the background. start = "/usr/local/cluster/start_linuxtest" stop = "/usr/local/cluster/stop_linuxtest"
I propose the following event structure:
This script is called when the local node joins the cluster or a remote node leaves the cluster. Does a boot -> service swap. Called by the node_up_local, node_down_remote scripts.
This script is called when a remote node leaves the cluster. Does a standby_address -> takeover_address swap if a standby_address is configured and up. Called by the node_down_remote, node_up_local scripts.
This script is periodically called as a timeout when the current event takes too long. Is primarily used to notify an operator. Called by the cluster manager daemon.
This script is called when a running event script returns an error code != 0. Called by the cluster manager daemon.
This event script is called when a standby adapter goes down. Called by the cluster manager daemon.
This script activates the disks and mounts filesystems. Called by the node_up_local, node_down_remote scripts.
This event script is called when a standby adapter goes up. Called by the cluster manager daemon.
This event script is called when a network goes down (all of the network adapters on a physical network are down or unreachable). Called by the cluster manager daemon. Has an associated complete script.
This event script is called when a network goes up. Called by the cluster manager daemon. Has an associated complete script.
This script is called when a node leaves the cluster. Called by the cluster manager daemon. Calls the respective sub-event script for local or remote. Has an associated complete script.
This script is called when the local node leaves the cluster. Called by node_down. Has an associated complete script.
This script is called when a remote node leaves the cluster. Called by node_down. Has an associated complete script.
This script is called when a node joins the cluster. Called by the cluster manager daemon. Calls the respective sub-event script for local or remote. Has an associated complete script.
This script is called when the local node joins the cluster. Called by node_down. Has an associated complete script.
This script is called when a remote node joins the cluster. Called by node_down. Has an associated complete script.
This script is called when the local node leaves the cluster. Called by node_down_local.
This script is called if the local node has the remote node's service address on its standby adapter, and either the remote node re-joins the cluster or the local node leaves the cluster gracefully. Called by node_down_local, node_up_remote.
This script unmounts filesystems and releases the disks.
This script starts the server application. Called when the local node is completely up or a remote node has finished leaving the cluster. Called by node_up_local_complete, node_down_remote_complete. Args: server_name.
This script is called to stop the application server(s) when a local node leaves the cluster or a remote node joins the cluster. Called by node_down_local, node_up_remote.
This event script is called when the service address of a node goes down. The cluster manager then swaps the service adapter with the standby adapter. Called by the cluster manager daemon. Has an associated complete script.
Plus, there are several sub-event scripts, some of the most important are
This script is used during adapter swap and IP address takeover. Swaps either a single adapter's address (first form) or two adapters at a time (second form).
This script is used to swap the MAC address of an adapter.
This subsection will be filled later.