/ Blog

HPC Backup Strategy: Data Classification, Incremental Backup, Tape Archive, and Cloud

Comprehensive HPC backup strategy: why HPC backup differs from standard IT, data classification, rsync incremental backup scripts, BeeGFS Buddy Mirroring, LTO-9 tape archiving with Bacula/Bareos, rclone for object storage, 3-2-1 rule for HPC, and retention policies.

HPC backup is fundamentally different from standard enterprise backup. The data volumes are orders of magnitude larger (petabytes vs. terabytes), the data has heterogeneous value (scratch job files vs. irreplaceable research data), and the backup window is constrained by active job I/O. Getting HPC backup right requires a different architecture than enterprise backup tools are designed for.

Why HPC Backup Differs

AspectEnterprise IT BackupHPC Backup
Data volume10s of TBHundreds of TB to PBs
Data velocityModerateVery high during jobs
Data valueUniformly importantHighly heterogeneous
Backup windowNightly (hours)Often no dedicated window
Restore patternIndividual files/VMsEntire project directories
Tool scalabilityDesigned for TBMust handle PBs

Data Classification

The first step is classifying data by value and change rate:

TierLocationBackup PriorityRetentionMethod
Scratch/scratchNone (transient)NoneNo backup
Home directories/homeHigh90 days dailyRsync + snapshots
Project/group data/projectHigh1 year + archiveRsync + tape
Results/publications/archiveCritical10+ yearsTape + cloud
System config/etc, /optMediumIndefiniteGit + image backup

Scratch filesystems typically contain 70–80% of total storage by capacity but 0% by value — they hold active job temporaries that are recreated for every run. Never back up scratch.

Incremental Backup with Rsync

For home directories and project storage, rsync-based incremental backup with hard links provides space-efficient daily snapshots:

#!/bin/bash
# /usr/local/sbin/hpc-backup.sh — daily incremental backup

BACKUP_SOURCE="/project"
BACKUP_DEST="/backup-nas/hpc-backups"
DATE=$(date +%Y-%m-%d)
LATEST="${BACKUP_DEST}/latest"
TARGET="${BACKUP_DEST}/${DATE}"

# Create new snapshot directory using hard links to previous snapshot
# (unchanged files share inodes, consuming minimal space)
if [ -d "$LATEST" ]; then
    cp -al "$LATEST" "$TARGET"
fi

# rsync: transfer only changed files
rsync -az \
  --delete \
  --exclude='*.tmp' \
  --exclude='.Trash*' \
  --log-file="/var/log/hpc-backup-${DATE}.log" \
  "${BACKUP_SOURCE}/" \
  "${TARGET}/"

RSYNC_EXIT=$?

if [ $RSYNC_EXIT -eq 0 ]; then
    # Update the "latest" symlink
    ln -snf "$TARGET" "$LATEST"
    echo "Backup completed successfully: ${TARGET}"
else
    echo "Backup FAILED with exit code: ${RSYNC_EXIT}" >&2
    # Alert monitoring system
    curl -s -X POST "https://alertmanager:9093/api/v1/alerts" \
      -d "[{\"labels\":{\"alertname\":\"BackupFailed\",\"severity\":\"critical\"}}]"
fi
# Cron job: run backup nightly at 2 AM
0 2 * * * /usr/local/sbin/hpc-backup.sh >> /var/log/hpc-backup-cron.log 2>&1

This approach creates daily snapshot directories where unchanged files share hard links — a 10 TB project directory with 5% daily change rate creates only ~500 GB of new data per snapshot rather than 10 TB.

BeeGFS Buddy Mirroring

For the parallel filesystem layer, BeeGFS Buddy Mirroring provides real-time storage redundancy — not a backup (point-in-time copies require separate backup) but protection against storage node failure:

# Enable automatic mirror group creation
beegfs-ctl --addmirrorgroup --automatic --nodetype=storage

# Enable mirroring for critical project directories
beegfs-ctl --setpattern --pattern=buddymirror --numtargets=4 \
  /mnt/beegfs/project/critical_data

# Check mirror group status
beegfs-ctl --listmirrorgroups --nodetype=storage
beegfs-ctl --checkfs --inaccessible

# Trigger resync after a node failure and recovery
beegfs-ctl --resyncstoragetargets --targetid=201 --wait

Tape Archiving with Bacula/Bareos

For long-term archive (publications, raw experimental data, final simulation results), tape (LTO-9: 18 TB native / 45 TB compressed per cartridge) provides the lowest cost per TB for cold data:

Bareos configuration (open-source Bacula fork):

# /etc/bareos/bareos-dir.conf — archive job definition

Job {
  Name = "HPC-Archive"
  Type = Backup
  Level = Incremental
  Client = hpc-archive-client
  FileSet = "HPC-Archive-FileSet"
  Schedule = "Monthly-Archive"
  Storage = LTO-Tape-Library
  Pool = Archive-Pool
  Priority = 5
  Write Bootstrap = "/var/lib/bareos/bootstrap/%c.bsr"
}

FileSet {
  Name = "HPC-Archive-FileSet"
  Include {
    Options {
      signature = MD5
      compression = GZIP
    }
    File = /project/archive
    File = /project/publications
  }
  Exclude {
    File = /project/archive/tmp
  }
}

Pool {
  Name = "Archive-Pool"
  Pool Type = Backup
  Recycle = yes
  AutoPrune = yes
  Volume Retention = 10 years
  Label Format = "Archive-"
  Storage Type = Tape
}

LTFS (Linear Tape File System) is an alternative that makes LTO tapes mountable as POSIX filesystems, eliminating the need for a tape management server for simple archive use cases:

# Format a tape with LTFS
ltfsck --format /dev/nst0

# Mount the tape
mkltfs --device=/dev/st0
ltfs /mnt/tape --devname=/dev/st0

# Copy data to tape
rsync -av /project/archive/2024/ /mnt/tape/archive_2024/

# Unmount
umount /mnt/tape

Cloud Object Storage with Rclone

For off-site backup and geographically distributed data protection, rclone supports S3-compatible object storage (AWS S3, MinIO, Ceph RGW, Wasabi):

# Configure rclone remote
rclone config create hpc-archive s3 \
  provider AWS \
  access_key_id AKIAIOSFODNN7EXAMPLE \
  secret_access_key wJalrXUtnFEMI/K7MDENG \
  region eu-west-1 \
  acl private

# Sync archive tier to cloud (parallel transfers)
rclone sync /project/archive s3:hpc-backup-archive \
  --transfers 16 \
  --s3-upload-concurrency 8 \
  --checksum \
  --log-level INFO \
  --log-file /var/log/rclone-archive.log

# Restore from cloud
rclone copy s3:hpc-backup-archive/2024/simulation-xyz /restore/simulation-xyz

For very large data volumes, use lifecycle policies to tier S3 objects to S3 Glacier or Glacier Deep Archive for further cost reduction after 90 days.

Backup Verification

Backups that are never tested are not backups — they are hopes. Establish a regular verification schedule:

# Verify rsync backup integrity for a random project
PROJECT=$(ls /backup-nas/hpc-backups/latest | shuf -n 1)
rsync --dry-run --checksum \
  "/project/${PROJECT}/" \
  "/backup-nas/hpc-backups/latest/${PROJECT}/" \
  | tail -20

# Test restore of random files
SAMPLE_FILE=$(find /backup-nas/hpc-backups/latest -name "*.h5" | shuf -n 5)
for f in $SAMPLE_FILE; do
    DEST=$(mktemp -d)
    cp "$f" "$DEST/"
    h5check "$DEST/$(basename $f)" && echo "OK: $f" || echo "CORRUPT: $f"
    rm -rf "$DEST"
done

Retention Policies

Data TypeHot CopyWarm CopyArchive
Home directories7 daily snapshots4 weekly snapshots12 monthly snapshots
Project active data7 daily4 weekly2 yearly
Final results7 dailyIndefiniteIndefinite on tape
Raw experimental data7 dailyIndefiniteIndefinite on tape+cloud
Published datasetN/A7 yearsIndefinite

HPC backup is a risk management exercise: you are trading backup infrastructure cost against the probability and impact of data loss. The cost of losing irreplaceable research data almost always exceeds the cost of a well-designed backup system. Contact Mevasis for HPC backup architecture assessment and implementation.