Hybrid HPC Architecture Guide: Cloud Bursting, SLURM, Storage

HPC demand is rarely flat. Research projects generate compute peaks that can be 5–10× average utilization, and sizing on-premise hardware for peak demand wastes capital during the troughs. Hybrid HPC solves this by combining the predictable performance and cost of on-premise infrastructure with the elastic capacity of cloud computing — with SLURM managing both transparently.

Four Core Architecture Components

On-premise cluster (on-premise): The steady-state workhorse. Handles continuous baseline workloads where InfiniBand latency, local NVMe storage, and fixed cost per compute hour matter. Tight-coupled MPI simulations that require < 2 µs inter-node latency must run here.

Cloud burst layer: Cloud compute instances that are provisioned on demand when the on-premise queue is saturated and automatically terminated when idle. Cost scales linearly with actual usage — zero cost when idle.

Unified storage: A shared storage layer accessible to both on-premise and cloud nodes. Three options: re-exporting the on-premise parallel filesystem via NFS/VPN, cloud-native object storage (S3-compatible), or a tiered hybrid. Storage choice is often the most constrained design decision.

Central orchestration (SLURM): A single SLURM controller manages both on-premise and cloud nodes. Users submit jobs with normal sbatch commands. SLURM decides whether the job runs on on-premise or cloud nodes transparently.

SLURM Cloud Bursting Configuration

# /etc/slurm/slurm.conf — hybrid HPC configuration

# On-premise nodes: always running, never suspended
PartitionName=onprem \
  Nodes=cn[01-32] \
  Default=YES \
  MaxTime=INFINITE \
  State=UP

# Cloud partition: nodes start "powered down", provisioned on demand
PartitionName=cloud \
  Nodes=cloud[01-128] \
  MaxTime=08:00:00 \
  State=UP \
  OverSubscribe=NO

# Cloud node definitions (initially powered down)
NodeName=cloud[01-128] \
  CPUs=96 \
  RealMemory=384000 \
  State=cloud \
  Feature=cloud

# Autoscaling hooks
ResumeProgram=/usr/local/sbin/slurm-cloud-resume.sh
SuspendProgram=/usr/local/sbin/slurm-cloud-suspend.sh
ResumeTimeout=360         # seconds until node must be ready
SuspendTime=300           # idle seconds before suspending node
SuspendExcNodes=cn[01-08] # never suspend these (on-premise baseline)

The slurm-cloud-resume.sh script calls the cloud provider API to launch an instance from a pre-baked image. The image must include SLURM, MPI libraries, and application software already installed — no provisioning time at startup.

Example resume script for AWS:

#!/bin/bash
# /usr/local/sbin/slurm-cloud-resume.sh
# Called by SLURM with a space-separated list of nodes to resume

NODES=$1
REGION="eu-west-1"
AMI_ID="ami-0abcdef1234567890"   # pre-baked image with all software
INSTANCE_TYPE="c6i.24xlarge"

for node in $(echo $NODES | tr ',' ' '); do
    # Get the IP that SLURM expects for this node
    # (pre-assigned in DNS or /etc/hosts)
    PRIVATE_IP=$(getent hosts $node | awk '{print $1}')

    aws ec2 run-instances \
        --region $REGION \
        --image-id $AMI_ID \
        --instance-type $INSTANCE_TYPE \
        --private-ip-address $PRIVATE_IP \
        --subnet-id subnet-burst-01 \
        --iam-instance-profile Name=hpc-burst-profile \
        --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=$node}]"
done

Three Storage Strategies

Storage is the most challenging hybrid HPC design decision. Three patterns serve different workload characteristics:

Strategy 1: BeeGFS/Lustre with NFS Re-Export

The on-premise parallel filesystem is re-exported via NFS or SSHFS to cloud nodes over a VPN or dedicated interconnect (AWS Direct Connect, Azure ExpressRoute).

# On on-premise NFS gateway server:
# Export a directory from BeeGFS for cloud use
echo "/mnt/beegfs/cloud-shared  10.100.0.0/16(rw,sync,no_subtree_check)" >> /etc/exports
exportfs -ra

# On cloud node (via slurm-cloud-resume.sh startup script):
mount -t nfs -o rw,soft,intr,timeo=60 \
  10.0.1.100:/mnt/beegfs/cloud-shared \
  /shared

Best for: Small to medium datasets (< 10 TB per job), latency-tolerant workloads, organizations with Direct Connect or ExpressRoute.

Limitation: WAN bandwidth (1–10 Gbps) is orders of magnitude below local BeeGFS bandwidth (100+ Gbps). Data-intensive workloads will saturate the WAN link.

Strategy 2: S3-Compatible Object Storage

Large or infrequently updated datasets are staged to S3 (or compatible: MinIO, Ceph RGW, Wasabi) before cloud jobs run. Jobs read from S3 and write results to S3. No VPN required.

# Job startup: download data from S3 to local NVMe
import boto3
import os

s3 = boto3.client('s3', region_name='eu-west-1')
bucket = 'hpc-burst-data'
job_id = os.environ['SLURM_JOB_ID']

# Download input data
s3.download_file(bucket, f'inputs/simulation_{job_id}.tar.gz', '/local/input.tar.gz')
os.system('tar -xzf /local/input.tar.gz -C /local/work/')

# ... run simulation ...

# Upload results to S3
import glob
for result_file in glob.glob('/local/work/results/*'):
    s3.upload_file(result_file, bucket, f'results/{job_id}/{os.path.basename(result_file)}')

Best for: Large datasets, embarrassingly parallel workloads, workloads that can pre-stage data before compute starts.

Limitation: Transfer time to/from S3 must be budgeted as part of job walltime.

Strategy 3: Tiered Storage (Intelligent)

Hot data (active project files) lives on on-premise NVMe scratch. Cold data (previous results, reference datasets) lives in cloud object storage. A data management policy (dCache, Spectrum Scale HSM, or custom scripts) automatically migrates data between tiers based on access patterns.

# Policy: move files not accessed in 7 days from scratch to S3
# /usr/local/sbin/tiering-policy.sh (run daily)
find /mnt/scratch -atime +7 -type f | while read file; do
    # Upload to object storage
    rclone copy "$file" s3:hpc-cold-tier/"${file#/mnt/scratch/}" \
      --ignore-existing
    # Create a stub file or remove (with metadata tracked)
    rm -f "$file"
    touch "$file.stub"  # mark as migrated
done

Best for: Organizations with complex data workflows, where some data is genuinely hot and others genuinely cold.

Network Latency Management

Three strategies for managing the WAN latency between on-premise and cloud:

Dedicated interconnect: AWS Direct Connect, Azure ExpressRoute, or Google Cloud Interconnect provides consistent 1–5 ms latency and predictable bandwidth. Required for workloads that read on-premise storage from cloud nodes.

Workload locality analysis: Quantify the ratio of (network transfer cost in time) to (compute time). If transferring 10 TB to cloud takes 2 hours and the job runs for 10 hours, transfer is 20% of total wall time — acceptable. If transfer is 8 hours and compute is 2 hours, cloud bursting is counterproductive.

MPI workload separation: Tight-coupled MPI jobs (where every MPI synchronization crosses the WAN link) should never burst to cloud. Only loosely-coupled workloads (independent replicas, parameter sweeps) should be routed to cloud partition.

# slurm.conf: mark cloud partition as inappropriate for tight-coupling
PartitionName=cloud Feature=loosely-coupled
# Jobs requiring InfiniBand features stay on-premise
PartitionName=onprem Feature=infiniband

Implementation Phases

Phase 1 — Workload analysis: Profile existing jobs by type (MPI coupling, data size, WAN latency tolerance). Identify which jobs are burst candidates. Measure current peak-to-average utilization ratio to quantify burst potential.

Phase 2 — On-premise cluster optimization: Update SLURM for cloud partition support. Optimize network and storage for hybrid operation. Select the storage strategy based on workload data requirements.

Phase 3 — Cloud integration and automation: Set up cloud account, VPC/VNet, security groups. Build and test pre-baked images. Write and test ResumeProgram / SuspendProgram scripts. Configure monitoring to include cloud nodes. Set cost limits and alerts.

Phase 4 — Testing, validation, and go-live: Run end-to-end burst test with real workloads. Verify cloud node registration in SLURM, job execution, and automatic termination after idle timeout. Validate cost monitoring. Open cloud partition to users with documented guidelines on which jobs are burst-appropriate.

Common Problems and Solutions

Node startup exceeds ResumeTimeout: Cloud instances take longer than expected to boot and register. Build more comprehensive images (no package installation at startup), use instance store NVMe for faster boot, or increase ResumeTimeout if hardware is consistently slightly slow.

NFS mount hangs on WAN disconnect: When the VPN or Direct Connect to on-premise NFS drops, cloud compute nodes with hard NFS mounts hang indefinitely. Use soft,intr,timeo=60 NFS mount options to allow timeouts and SIGINT interruption.

Uncontrolled cloud spend: Cloud partition with unlimited MaxCount and no cost controls can generate unexpected bills. Always set:

SLURM GrpTRESMins to limit total CPU-hours per period
Cloud provider budget alerts at 50%, 80%, 100% of monthly budget
Instance tag policies for cost attribution per SLURM job ID

MPI jobs failing on cloud nodes: Cloud instances may use different network interface names than expected by OpenMPI. Set --mca btl_tcp_if_include eth0 (or the correct interface) in the mpirun command, or configure via ~/.openmpi/mca-params.conf.

Best Practices

Set MaxTime on the cloud partition (e.g., 8 hours). Jobs running indefinitely in cloud will generate unexpectedly large bills.
Keep cloud node images identical to on-premise compute node images. Version skew causes subtle job failures.
Display cost visibility in Grafana alongside performance metrics — normalize $USD/core-hour so researchers can see their burst spending.
Pre-stage large input files to cloud object storage before peak periods (not during the burst).

Hybrid HPC makes the on-premise/cloud choice a policy decision rather than a hardware decision. For hybrid HPC architecture design, SLURM cloud integration, and storage strategy consulting, contact Mevasis.

Hybrid HPC Technical Guide: Architecture, SLURM Cloud Bursting, Storage Strategies