RAID 1D (Mirroring with delay)

Nov 20, 2014

Summary

RAID 1D is a reliable and cost-effective method to protect data in non-mission critical environments that face a higher risk of damage from user error (accidental file deletion or overwrites) where a real-time RAID setup would not offer protection.

Configuration

Two hard drives with identical capacities (recommended that they be of different vintages / manufacturers) are required for RAID 1D:

Hard drive 1 is the primary drive, to which all read/write activity occurs in real-time. The hard drive is configured to be always-on and is mounted locally and via NFS, SMB, etc.
Hard drive 2 is the replica drive, which is powered on but kept spun down in a sleep mode. This drive is woken up to receive an rsync job on a periodic basis (typically daily), but otherwise remains offline to create a “clean room” environment. HDD 2 should only be mounted outside of the scheduled replication window for data recovery purposes to rebuild data on HDD 1.

As with RAID 1, the data is duplicated across two drives, helping to safeguard it from hardware malfunction on either drive. However, since RAID 1 writes all filesystem changes immediately to both drives, it does not protect against human-introduced errors in the filesystem such as accidental file deletion, accidental overwriting of existing files, or even extreme instances of mutilation like drive formatting or file table manipulation. The replication time delay introduced with RAID 1D provides a user with a previous state from which to retrieve unaffected data, which is stored safely in an immutable state where it is immune from many common accidents and even from a wide range of malicious scripts.

Naming

RAID 1D is so named because it mirrors data RAID 1 with a delay D.

Improving replica drive isolation

To improve the security and isolation of the data on the replica drive, the volume should remain unmounted until the replication task begins. With the drive unmounted, the filesystem has greater protection from careless or malicious commands, although the physical device may be damaged by commands targeting the device itself. When mounting the replica, mount it only to a single, local mount point on the server. Never allow access to the replica drive via NFS, SMB, or any other protocol to limit the processes and machines that have access to the device.

Rationale

By introducing a time delay in replication, the security and consistency of the data stored on the mirror is assured until the next replication job. Longer delays between replication jobs offer a greater period in which to recover from accidents on the master drive. However, the longer the delay, the greater the risk of losing recently-created data that has not been replicated to the mirror.

Choosing a delay depends on the data usage scenario. For the case of a single user doing non-critical work, RAID 1D with a 24 or 48 hour replication period provides a balance between keeping the backup data safe and limiting loss of fresh data to a relatively short amount of time.

Additionally, the user can trigger a manual replication event after producing valuable content in-between scheduled replications. Manual replication events should be used sparingly, as accelerating the replication schedule eliminates the delayed protection that RAID 1D provides.

Limitations

The typical 24-48 hour replication schedule of RAID 1D may not be suitable for mission-critical environments such as database servers. However, a RAID 1D mechanism may be useful as a component of a larger backup and archival scheme. It is possible to reduce the replication delay to a shorter interval, for example to 4-6 hours, but in that scenario there will be less time to catch accidental mistakes before they are replicated to the mirror. Assuming the system is actively used or monitored, 4-6 hours may be sufficient time to notice corruption and begin a recovery job.

Backup procedure

Implementing RAID 1D is very straightforward. Schedule a cron job to run daily when the system will have the least amount of use. Use an rsync job with checksum comparison enabled to ensure the highest quality replication.

Example script:

rsync -axvh --delete --exclude-from="x" /hdd1/ /hdd2/

Example crontab entry:

0 3 * * * /home/user/scripts/replicate.sh

Recovery procedure

Immediately following a data loss event on the master drive, deactivate the replication job to protect the data on the replica drive. This is especially important if the replication interval is more frequent than 24 hours, or if the accident occurs close to the next replication time (which should be publicized to all system users). Proceed to mount and recover the data from the replica drive. Once the recovery is complete, return the replica drive to its unmounted, powered-down state.

This article and the concept described herein, referred to as RAID 1D, was developed and written by Gaelan Lloyd on 20 November 2014. You’re free to build upon this work or use it in your system with attribution to the author. No support is provided or warranty implied — use at your own risk.