Rethinking Failsafes for Critical Linux Systems
Mar 19, 2010 5:00 AM PT
The Linux operating system is highly compatible with two hot computing trends: virtualization and cloud computing. Just as the 2001-2002 recession helped usher in Linux as a mainstream solution, virtualization may accelerate Linux usage during and after the current recession. Linux already has a powerful presence in the database and ERP realms. Currently, for every US$3 spent worldwide on Windows-based servers, $1 is spent for Linux-based servers. Most organizations either already have Linux servers with critical information to protect or they could soon.
Today's demanding information environment dictates the careful assessment of how to protect critical Linux servers and their information. It is important that
- the latest updates and configuration settings are automatically saved;
- the Linux server can failover with little interruption;
- there is a low recovery point and time objective, and data is current;
- a target server can re-synchronize after hours or days have passed; and
- database transactions are replicated over a WAN with write order preservation.
Traditionally, there are two main ways to protect information on Linux servers: backup software and the rsync utility. However, these often do not have good answers to important questions.
Can a Linux System State and Associated Applications Be Successfully Restored?
Yes! The configurations of Linux and server applications are often customized during the installation as well as ongoing maintenance and general troubleshooting. Even servers with very similar functions are often configured differently. A primary goal to protecting a critical Linux server is being able to repair or replace the system and get it back into production quickly.
The best-documented changes can quickly become outdated and often cause errors if not found until the damage has been done. Having a process that will automatically protect the unique configuration information will allow those changes to be applied to a standby or replacement server for rapid recovery.
Can a Linux Server Automatically Failover to a Standby Target for High Availability?
The key objective should be to get a server back to a functional condition where users can be productive. Many think that downtime is simply how long it takes to get data back. However, restoring lost data from backups can take several hours, if not days. Recovering an entire workload -- its operating system, applications and their data -- can be extremely complicated and take much longer if recovering from tape. In most cases, that is unacceptable; a different approach should be considered.
Restoring a system state is only the first step of the recovery, and there are still the requirements of restarting particular processes in the right order, as well as applying changes to local network and global DNS-type services. A process that can recover an entire Linux workload in one single step can greatly reduce RTO and RPO and get the server back into production quickly.
How Old Is the Last Backup?
Synchronizing data to a target in real-time avoids out-of-date information and greatly improves recovery point objectives. Backups and snapshots are typically performed once to a few times per day, so the loss of even an hour of updates can take days to reconstruct, or worse, be lost forever.
Rsync is a utility built into most Linux distributions, but it cannot make changes to the operating system of a running target. Rsync also has no automatic method to failover for availability or, more importantly, failback to recover the original Linux server.
When Should Servers Resynchronize After They Have Been Disconnected?
There are two main times when the ability to resynchronize between a source and target is most important. The first is during the initial protection phase. If a network disconnection occurs, the process to resynchronize must be efficient, and until that has been completed, the source has no effective protection.
The second is after a failover to a target has occurred. During that time, the target -- now acting as the source -- receives the ongoing production changes. When the source becomes available, the data will be out of synch from the target running as production. It is very important that the data be re-synchronized as soon as possible so that a failback can restore the system back to production.
How Are Transactional Database Changes Captured to Preserve Write Order Integrity?
A target at a different location can protect against a primary site outage or disaster. Routing along multiple network paths assures that the WAN itself is not the single point of failure. However, the multiple data transmissions that make up a single transaction can route along different network paths. These may arrive at the target in a very different order than they started.
For highly transactional applications like databases and email, the result can be that a target fails data integrity checks so that it cannot function as the source. This is similar to the need for these applications to be quiesced during backups to prevent "fuzzy backups" from occurring. There must be a mechanism to assure the write order integrity of the transaction components and therefore properly protect a system.
Adopting these steps for protecting critical Linux systems will not only keep the system highly available but also reduce recovery time and point objectives from other traditional methods of backup. Being able to recover an entire system state also helps with the ability to recover a server that may not be the same and reduce the need to install the operating system, applications and associated data in order to bring the system back into production more rapidly.
Brace Rennels is CBCP (certified business continuity professional) at Double-Take Software.