HPC Backup Guide: Rsync, BeeGFS Mirroring, Tape, Rclone Cloud

HPC backup is fundamentally different from standard enterprise backup. The data volumes are orders of magnitude larger (petabytes vs. terabytes), the data has heterogeneous value (scratch job files vs. irreplaceable research data), and the backup window is constrained by active job I/O. Getting HPC backup right requires a different architecture than enterprise backup tools are designed for.

Why HPC Backup Differs

Aspect	Enterprise IT Backup	HPC Backup
Data volume	10s of TB	Hundreds of TB to PBs
Data velocity	Moderate	Very high during jobs
Data value	Uniformly important	Highly heterogeneous
Backup window	Nightly (hours)	Often no dedicated window
Restore pattern	Individual files/VMs	Entire project directories
Tool scalability	Designed for TB	Must handle PBs

Data Classification

The first step is classifying data by value and change rate:

Tier	Location	Backup Priority	Retention	Method
Scratch	/scratch	None (transient)	None	No backup
Home directories	/home	High	90 days daily	Rsync + snapshots
Project/group data	/project	High	1 year + archive	Rsync + tape
Results/publications	/archive	Critical	10+ years	Tape + cloud
System config	/etc, /opt	Medium	Indefinite	Git + image backup

Scratch filesystems typically contain 70–80% of total storage by capacity but 0% by value — they hold active job temporaries that are recreated for every run. Never back up scratch.

Incremental Backup with Rsync

For home directories and project storage, rsync-based incremental backup with hard links provides space-efficient daily snapshots:

#!/bin/bash
# /usr/local/sbin/hpc-backup.sh — daily incremental backup

BACKUP_SOURCE="/project"
BACKUP_DEST="/backup-nas/hpc-backups"
DATE=$(date +%Y-%m-%d)
LATEST="${BACKUP_DEST}/latest"
TARGET="${BACKUP_DEST}/${DATE}"

# Create new snapshot directory using hard links to previous snapshot
# (unchanged files share inodes, consuming minimal space)
if [ -d "$LATEST" ]; then
    cp -al "$LATEST" "$TARGET"
fi

# rsync: transfer only changed files
rsync -az \
  --delete \
  --exclude='*.tmp' \
  --exclude='.Trash*' \
  --log-file="/var/log/hpc-backup-${DATE}.log" \
  "${BACKUP_SOURCE}/" \
  "${TARGET}/"

RSYNC_EXIT=$?

if [ $RSYNC_EXIT -eq 0 ]; then
    # Update the "latest" symlink
    ln -snf "$TARGET" "$LATEST"
    echo "Backup completed successfully: ${TARGET}"
else
    echo "Backup FAILED with exit code: ${RSYNC_EXIT}" >&2
    # Alert monitoring system
    curl -s -X POST "https://alertmanager:9093/api/v1/alerts" \
      -d "[{\"labels\":{\"alertname\":\"BackupFailed\",\"severity\":\"critical\"}}]"
fi

# Cron job: run backup nightly at 2 AM
0 2 * * * /usr/local/sbin/hpc-backup.sh >> /var/log/hpc-backup-cron.log 2>&1

This approach creates daily snapshot directories where unchanged files share hard links — a 10 TB project directory with 5% daily change rate creates only ~500 GB of new data per snapshot rather than 10 TB.

BeeGFS Buddy Mirroring

For the parallel filesystem layer, BeeGFS Buddy Mirroring provides real-time storage redundancy — not a backup (point-in-time copies require separate backup) but protection against storage node failure:

# Enable automatic mirror group creation
beegfs-ctl --addmirrorgroup --automatic --nodetype=storage

# Enable mirroring for critical project directories
beegfs-ctl --setpattern --pattern=buddymirror --numtargets=4 \
  /mnt/beegfs/project/critical_data

# Check mirror group status
beegfs-ctl --listmirrorgroups --nodetype=storage
beegfs-ctl --checkfs --inaccessible

# Trigger resync after a node failure and recovery
beegfs-ctl --resyncstoragetargets --targetid=201 --wait

Tape Archiving with Bacula/Bareos

For long-term archive (publications, raw experimental data, final simulation results), tape (LTO-9: 18 TB native / 45 TB compressed per cartridge) provides the lowest cost per TB for cold data:

Bareos configuration (open-source Bacula fork):

# /etc/bareos/bareos-dir.conf — archive job definition

Job {
  Name = "HPC-Archive"
  Type = Backup
  Level = Incremental
  Client = hpc-archive-client
  FileSet = "HPC-Archive-FileSet"
  Schedule = "Monthly-Archive"
  Storage = LTO-Tape-Library
  Pool = Archive-Pool
  Priority = 5
  Write Bootstrap = "/var/lib/bareos/bootstrap/%c.bsr"
}

FileSet {
  Name = "HPC-Archive-FileSet"
  Include {
    Options {
      signature = MD5
      compression = GZIP
    }
    File = /project/archive
    File = /project/publications
  }
  Exclude {
    File = /project/archive/tmp
  }
}

Pool {
  Name = "Archive-Pool"
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 10 years
  Label Format = "Archive-"
  Storage Type = Tape
}

LTFS (Linear Tape File System) is an alternative that makes LTO tapes mountable as POSIX filesystems, eliminating the need for a tape management server for simple archive use cases:

# Format a tape with LTFS
ltfsck --format /dev/nst0

# Mount the tape
mkltfs --device=/dev/st0
ltfs /mnt/tape --devname=/dev/st0

# Copy data to tape
rsync -av /project/archive/2024/ /mnt/tape/archive_2024/

# Unmount
umount /mnt/tape

Cloud Object Storage with Rclone

For off-site backup and geographically distributed data protection, rclone supports S3-compatible object storage (AWS S3, MinIO, Ceph RGW, Wasabi):

# Configure rclone remote
rclone config create hpc-archive s3 \
  provider AWS \
  access_key_id AKIAIOSFODNN7EXAMPLE \
  secret_access_key wJalrXUtnFEMI/K7MDENG \
  region eu-west-1 \
  acl private

# Sync archive tier to cloud (parallel transfers)
rclone sync /project/archive s3:hpc-backup-archive \
  --transfers 16 \
  --s3-upload-concurrency 8 \
  --checksum \
  --log-level INFO \
  --log-file /var/log/rclone-archive.log

# Restore from cloud
rclone copy s3:hpc-backup-archive/2024/simulation-xyz /restore/simulation-xyz

For very large data volumes, use lifecycle policies to tier S3 objects to S3 Glacier or Glacier Deep Archive for further cost reduction after 90 days.

Backup Verification

Backups that are never tested are not backups — they are hopes. Establish a regular verification schedule:

# Verify rsync backup integrity for a random project
PROJECT=$(ls /backup-nas/hpc-backups/latest | shuf -n 1)
rsync --dry-run --checksum \
  "/project/${PROJECT}/" \
  "/backup-nas/hpc-backups/latest/${PROJECT}/" \
  | tail -20

# Test restore of random files
SAMPLE_FILE=$(find /backup-nas/hpc-backups/latest -name "*.h5" | shuf -n 5)
for f in $SAMPLE_FILE; do
    DEST=$(mktemp -d)
    cp "$f" "$DEST/"
    h5check "$DEST/$(basename $f)" && echo "OK: $f" || echo "CORRUPT: $f"
    rm -rf "$DEST"
done

Retention Policies

Data Type	Hot Copy	Warm Copy	Archive
Home directories	7 daily snapshots	4 weekly snapshots	12 monthly snapshots
Project active data	7 daily	4 weekly	2 yearly
Final results	7 daily	Indefinite	Indefinite on tape
Raw experimental data	7 daily	Indefinite	Indefinite on tape+cloud
Published dataset	N/A	7 years	Indefinite

HPC backup is a risk management exercise: you are trading backup infrastructure cost against the probability and impact of data loss. The cost of losing irreplaceable research data almost always exceeds the cost of a well-designed backup system. Contact Mevasis for HPC backup architecture assessment and implementation.

HPC Backup Strategy: Data Classification, Incremental Backup, Tape Archive, and Cloud