HPC Disaster Recovery: RTO/RPO, Checkpointing, and Failover Planning
HPC disaster recovery strategy: RTO and RPO for HPC environments, 3-2-1 backup rule, DMTCP checkpoint/restart, SLURM checkpoint directives, active-passive failover, storage replication, and DR testing schedule.
An HPC cluster is not a transaction processing system — the failure of a node during a 72-hour simulation does not corrupt a database. But it does destroy hours or days of computation. Effective HPC disaster recovery addresses two distinct challenges: recovering the infrastructure quickly after failures, and preserving the computational work in progress so jobs can resume rather than restart from scratch.
RTO and RPO in HPC Context
Traditional IT disaster recovery metrics apply to HPC with nuances:
Recovery Time Objective (RTO): How quickly must the cluster be restored to operational state? For research clusters, 4–24 hours is typically acceptable (researchers work around outages). For production HPC supporting business-critical simulations, RTO may be 1–4 hours.
Recovery Point Objective (RPO): How much data can be lost? For user data on persistent storage, RPO should be 24 hours or less (daily backup). For active job state, RPO depends on checkpoint frequency — a job that checkpoints every 2 hours can lose at most 2 hours of computation.
Job restart vs. resume: This is HPC-specific. Without checkpointing, a failed job must restart from scratch. With checkpointing, it resumes from the last checkpoint. The difference for a 5-day simulation with a node failure after day 4 is restarting from the beginning vs. resuming from day 4.
The 3-2-1 Rule Applied to HPC
The 3-2-1 backup rule (3 copies, 2 different media, 1 offsite) applies to HPC data as follows:
| Copy | Location | Media | Purpose |
|---|---|---|---|
| Primary | Production parallel filesystem (BeeGFS) | NVMe + SAS | Active job data |
| Secondary | Backup NAS | HDD RAID | Daily snapshots of user data |
| Offsite | Object storage or tape | Cloud/LTO | Long-term archive, DR copy |
Active scratch (job I/O) is typically not backed up — it is by definition transient. User home directories and project storage should be snapshotted daily.
Checkpoint/Restart with DMTCP
DMTCP (Distributed MultiThreaded CheckPointing) can checkpoint running MPI applications without any source code modification. This makes it applicable to legacy HPC codes where source is unavailable.
# Install DMTCP
git clone https://github.com/dmtcp/dmtcp
cd dmtcp && ./configure && make && sudo make install
# Launch MPI application under DMTCP control
dmtcp_launch --interval 3600 \ # checkpoint every 3600 seconds (1 hour)
mpirun -np 64 ./my_simulation arg1 arg2
# Checkpoint files are written to dmtcp-ckpt-* directories
# To restart from checkpoint:
dmtcp_restart dmtcp-ckpt-*/dmtcp_restart_script.sh
For codes that support native checkpointing (NAMD, OpenFOAM, GROMACS, LS-DYNA), prefer the application’s built-in mechanism — it is aware of the simulation state and produces more compact, more reliable checkpoint files.
SLURM Checkpoint Configuration
SLURM supports checkpointing through several mechanisms:
# Configure checkpoint for a SLURM job at submission
sbatch --checkpoint=01:00:00 \ # checkpoint every 1 hour
--checkpoint-dir=/scratch/ckpts \
my_job.sh
# Manual checkpoint of running job
scontrol checkpoint create <jobid>
# List available checkpoints
scontrol checkpoint list <jobid>
# Restart job from checkpoint
scontrol checkpoint restart <jobid>
For automatic checkpoint-on-preemption:
# slurm.conf
CheckpointType=checkpoint/blcr # requires BLCR library
JobCheckpointDir=/shared/checkpoints
BLCR (Berkeley Lab Checkpoint/Restart) provides kernel-level checkpoint support on Linux and integrates directly with SLURM.
Active-Passive Failover for SLURM Controller
The slurmctld controller is the single point of failure for job scheduling. Configure active-passive HA:
# slurm.conf — primary and backup controllers
ControlMachine=mgmt01
BackupController=mgmt02
The backup controller (mgmt02) automatically promotes to primary if mgmt01 becomes unreachable. Both controllers share:
- SLURM state directory (on shared NFS or shared storage)
- MUNGE authentication keys
slurmdbddatabase connection
# Verify HA status
scontrol show config | grep ControlMachine
# Force failover for testing
scontrol shutdown controller # on primary
# mgmt02 should promote within seconds
Storage Replication
For parallel filesystem (BeeGFS) disaster recovery:
BeeGFS Buddy Mirroring (recommended for critical data):
# Enable storage mirroring
beegfs-ctl --addmirrorgroup --automatic --nodetype=storage
# Mirror a specific directory
beegfs-ctl --setpattern --pattern=buddymirror /mnt/beegfs/critical_projects
Rsync replication to offsite NAS:
# /usr/local/sbin/backup-home-dirs.sh
#!/bin/bash
rsync -az --delete \
--link-dest=/backup/previous_snapshot \
/home/ \
backup-nas:/backup/latest_snapshot/
# Update symlink
ssh backup-nas "ln -snf /backup/latest_snapshot /backup/current"
Object storage replication (for archive tier):
# rclone sync to S3-compatible object storage
rclone sync /mnt/project/archive s3:hpc-archive-bucket \
--transfers 16 \
--checksum \
--log-file /var/log/rclone-backup.log
DR Testing Schedule
A DR plan that has never been tested is not a DR plan. Schedule regular tests:
| Test Type | Frequency | What to Test |
|---|---|---|
| SLURM HA failover | Monthly | Promote backup controller, verify job scheduling resumes |
| Backup restore | Quarterly | Restore random files from backup, verify integrity |
| Storage failure | Semi-annual | Pull a BeeGFS storage node, verify Buddy Mirror serves data |
| Full site failover | Annual | Simulate complete primary site failure, restore to DR site |
DR Checklist
Before declaring an HPC cluster production-ready from a DR perspective:
- SLURM active-passive HA configured and tested
-
slurmdbdreplicated to backup host - Home directory backup running daily with verified restores
- Project storage snapshotted and backed up offsite
- Scratch/NVMe storage documented (not backed up, acceptable)
- Application checkpoint enabled for jobs > 4 hours
- MUNGE keys backed up securely
- Cluster configuration (Ansible/Salt) stored in version control
- DR runbook documented and accessible offline
- DR contact list and escalation path defined
- RTO/RPO targets documented and accepted by stakeholders
HPC disaster recovery is less about preventing failures (failures are inevitable) and more about bounding their impact on research productivity. A cluster that recovers in 4 hours and restarts jobs from hour 23 of a 24-hour checkpoint is a very different outcome from one that takes 48 hours to rebuild and restarts jobs from the beginning.
For HPC disaster recovery architecture and implementation, contact Mevasis.