HPC Disaster Recovery Planning: RTO, RPO, Checkpoint, Failover

An HPC cluster is not a transaction processing system — the failure of a node during a 72-hour simulation does not corrupt a database. But it does destroy hours or days of computation. Effective HPC disaster recovery addresses two distinct challenges: recovering the infrastructure quickly after failures, and preserving the computational work in progress so jobs can resume rather than restart from scratch.

RTO and RPO in HPC Context

Traditional IT disaster recovery metrics apply to HPC with nuances:

Recovery Time Objective (RTO): How quickly must the cluster be restored to operational state? For research clusters, 4–24 hours is typically acceptable (researchers work around outages). For production HPC supporting business-critical simulations, RTO may be 1–4 hours.

Recovery Point Objective (RPO): How much data can be lost? For user data on persistent storage, RPO should be 24 hours or less (daily backup). For active job state, RPO depends on checkpoint frequency — a job that checkpoints every 2 hours can lose at most 2 hours of computation.

Job restart vs. resume: This is HPC-specific. Without checkpointing, a failed job must restart from scratch. With checkpointing, it resumes from the last checkpoint. The difference for a 5-day simulation with a node failure after day 4 is restarting from the beginning vs. resuming from day 4.

The 3-2-1 Rule Applied to HPC

The 3-2-1 backup rule (3 copies, 2 different media, 1 offsite) applies to HPC data as follows:

Copy	Location	Media	Purpose
Primary	Production parallel filesystem (BeeGFS)	NVMe + SAS	Active job data
Secondary	Backup NAS	HDD RAID	Daily snapshots of user data
Offsite	Object storage or tape	Cloud/LTO	Long-term archive, DR copy

Active scratch (job I/O) is typically not backed up — it is by definition transient. User home directories and project storage should be snapshotted daily.

Checkpoint/Restart with DMTCP

DMTCP (Distributed MultiThreaded CheckPointing) can checkpoint running MPI applications without any source code modification. This makes it applicable to legacy HPC codes where source is unavailable.

# Install DMTCP
git clone https://github.com/dmtcp/dmtcp
cd dmtcp && ./configure && make && sudo make install

# Launch MPI application under DMTCP control
dmtcp_launch --interval 3600 \      # checkpoint every 3600 seconds (1 hour)
  mpirun -np 64 ./my_simulation arg1 arg2

# Checkpoint files are written to dmtcp-ckpt-* directories
# To restart from checkpoint:
dmtcp_restart dmtcp-ckpt-*/dmtcp_restart_script.sh

For codes that support native checkpointing (NAMD, OpenFOAM, GROMACS, LS-DYNA), prefer the application’s built-in mechanism — it is aware of the simulation state and produces more compact, more reliable checkpoint files.

SLURM Checkpoint Configuration

SLURM supports checkpointing through several mechanisms:

# Configure checkpoint for a SLURM job at submission
sbatch --checkpoint=01:00:00 \          # checkpoint every 1 hour
       --checkpoint-dir=/scratch/ckpts \
       my_job.sh

# Manual checkpoint of running job
scontrol checkpoint create <jobid>

# List available checkpoints
scontrol checkpoint list <jobid>

# Restart job from checkpoint
scontrol checkpoint restart <jobid>

For automatic checkpoint-on-preemption:

# slurm.conf
CheckpointType=checkpoint/blcr    # requires BLCR library
JobCheckpointDir=/shared/checkpoints

BLCR (Berkeley Lab Checkpoint/Restart) provides kernel-level checkpoint support on Linux and integrates directly with SLURM.

Active-Passive Failover for SLURM Controller

The slurmctld controller is the single point of failure for job scheduling. Configure active-passive HA:

# slurm.conf — primary and backup controllers
ControlMachine=mgmt01
BackupController=mgmt02

The backup controller (mgmt02) automatically promotes to primary if mgmt01 becomes unreachable. Both controllers share:

SLURM state directory (on shared NFS or shared storage)
MUNGE authentication keys
slurmdbd database connection

# Verify HA status
scontrol show config | grep ControlMachine
# Force failover for testing
scontrol shutdown controller       # on primary
# mgmt02 should promote within seconds

Storage Replication

For parallel filesystem (BeeGFS) disaster recovery:

BeeGFS Buddy Mirroring (recommended for critical data):

# Enable storage mirroring
beegfs-ctl --addmirrorgroup --automatic --nodetype=storage

# Mirror a specific directory
beegfs-ctl --setpattern --pattern=buddymirror /mnt/beegfs/critical_projects

Rsync replication to offsite NAS:

# /usr/local/sbin/backup-home-dirs.sh
#!/bin/bash
rsync -az --delete \
  --link-dest=/backup/previous_snapshot \
  /home/ \
  backup-nas:/backup/latest_snapshot/

# Update symlink
ssh backup-nas "ln -snf /backup/latest_snapshot /backup/current"

Object storage replication (for archive tier):

# rclone sync to S3-compatible object storage
rclone sync /mnt/project/archive s3:hpc-archive-bucket \
  --transfers 16 \
  --checksum \
  --log-file /var/log/rclone-backup.log

DR Testing Schedule

A DR plan that has never been tested is not a DR plan. Schedule regular tests:

Test Type	Frequency	What to Test
SLURM HA failover	Monthly	Promote backup controller, verify job scheduling resumes
Backup restore	Quarterly	Restore random files from backup, verify integrity
Storage failure	Semi-annual	Pull a BeeGFS storage node, verify Buddy Mirror serves data
Full site failover	Annual	Simulate complete primary site failure, restore to DR site

DR Checklist

Before declaring an HPC cluster production-ready from a DR perspective:

SLURM active-passive HA configured and tested
slurmdbd replicated to backup host
Home directory backup running daily with verified restores
Project storage snapshotted and backed up offsite
Scratch/NVMe storage documented (not backed up, acceptable)
Application checkpoint enabled for jobs > 4 hours
MUNGE keys backed up securely
Cluster configuration (Ansible/Salt) stored in version control
DR runbook documented and accessible offline
DR contact list and escalation path defined
RTO/RPO targets documented and accepted by stakeholders

HPC disaster recovery is less about preventing failures (failures are inevitable) and more about bounding their impact on research productivity. A cluster that recovers in 4 hours and restarts jobs from hour 23 of a 24-hour checkpoint is a very different outcome from one that takes 48 hours to rebuild and restarts jobs from the beginning.

For HPC disaster recovery architecture and implementation, contact Mevasis.

HPC Disaster Recovery: RTO/RPO, Checkpointing, and Failover Planning