Cloud Bursting for HPC: Architecture, SLURM Configuration, and Cost Control
How to implement cloud bursting for HPC clusters: SLURM scheduler configuration, network connectivity options, spot/preemptible instances, and integration with AWS, Azure, and Google Cloud.
Research computing demand is inherently bursty. A typical HPC cluster runs at 60–70% average utilization punctuated by episodes where every user simultaneously needs maximum resources. Purchasing hardware to cover peak demand leaves expensive capacity idle the rest of the time. Cloud bursting solves this by extending the on-premise cluster into the cloud only when demand exceeds local capacity.
What Is Cloud Bursting?
Cloud bursting is the practice of running HPC jobs on dynamically provisioned cloud instances when the on-premise cluster queue is full. From the user perspective, nothing changes: jobs are submitted to SLURM with the same sbatch commands. The scheduler transparently routes excess demand to cloud-provisioned nodes.
The on-premise cluster handles the steady-state workload, where InfiniBand interconnects and local NVMe storage deliver full HPC performance. Cloud nodes absorb burst demand for loosely-coupled workloads that tolerate higher inter-node latency and shared network storage.
SLURM Configuration for Cloud Bursting
SLURM’s burst capability relies on the ResumeProgram and SuspendProgram hooks. Define a cloud partition separate from the on-premise partition:
# /etc/slurm/slurm.conf
# On-premise nodes — always running
PartitionName=onprem Nodes=cn[01-32] Default=YES MaxTime=INFINITE State=UP
# Cloud partition — nodes start suspended and are launched on demand
PartitionName=cloud Nodes=cloud[01-128] MaxTime=08:00:00 State=UP
# Autoscaling hooks
ResumeProgram=/usr/local/sbin/resume-cloud-node.sh
SuspendProgram=/usr/local/sbin/suspend-cloud-node.sh
ResumeTimeout=360 # 6 minutes for cloud instance boot
SuspendTime=300 # idle time before terminating node
The resume-cloud-node.sh script calls the cloud API to launch a pre-configured instance. The suspend-cloud-node.sh terminates it. The node must register with SLURM before ResumeTimeout expires, so pre-baked AMI/images with all software pre-installed are essential.
Network Connectivity
Three connectivity patterns are used in practice:
IPsec VPN: Lower cost, higher latency (~5–20 ms). Suitable for loosely-coupled jobs (embarrassingly parallel, parameter sweeps). Inadequate for tight MPI workloads.
Dedicated interconnect (AWS Direct Connect / Azure ExpressRoute): Fixed latency (~1–3 ms), predictable bandwidth. Required when cloud nodes need to read from on-premise parallel storage. Higher monthly cost.
Cloud-native storage only: The cloud burst jobs read input data from S3-compatible object storage and write results back to object storage. No dedicated interconnect needed. Suitable for large-scale, data-independent simulations.
Spot and Preemptible Instances
Cloud instances can be reduced to 10–30% of on-demand price using Spot (AWS), Spot VMs (Azure), or Preemptible VMs (GCP). The trade-off is that the provider can terminate the instance with 2-minute notice.
To handle preemption gracefully:
# slurm.conf — enable job requeue on node failure
JobRequeue=1
InactiveLimit=0
# Make the cloud partition preemptable with requeue
PartitionName=cloud PreemptMode=REQUEUE
Applications that write periodic checkpoints can survive preemption by restarting from the last checkpoint. DMTCP (Distributed MultiThreaded CheckPointing) or application-native checkpointing (NAMD, OpenFOAM) are appropriate for long-running burst jobs.
AWS ParallelCluster Integration
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: burst
ComputeResources:
- Name: hpc6a-48xlarge
InstanceType: hpc6a.48xlarge
MinCount: 0
MaxCount: 64
SpotPrice: "0.60" # use Spot at max $0.60/hr
Networking:
SubnetIds:
- subnet-burst-az1
- subnet-burst-az2
Image:
Os: alinux2
CustomAmi: ami-0abcdef1234567890 # pre-baked AMI
Azure CycleCloud Integration
# cluster template excerpt
[[nodearray burst]]
MachineType = Standard_HBv3-120
Azure.MaxScaleSetSize = 200
InitialCount = 0
MaxCount = 64
Interruptible = true # Spot instances
MaxPrice = 1.50 # $/hour ceiling
[[[configuration]]]
slurm.autoscale = true
slurm.partition = cloud
Google Cloud HPC Toolkit
- id: slurm_partition_burst
source: community/modules/scheduler/SchedMD-slurm-on-gcp-partition
settings:
partition_name: cloud
machine_type: c2-standard-60
max_node_count: 48
enable_spot_vm: true
spot_instance_config:
termination_action: STOP
Cost Control Mechanisms
Uncontrolled cloud bursting can generate unexpected bills. Implement multiple guardrails:
# SLURM: limit total CPU-hours for the cloud partition
sacctmgr modify partition cloud set GrpTRESMins=cpu=100000
# SLURM: limit per-job resource allocation in cloud partition
sacctmgr add qos cloud_burst \
MaxTRESPerJob=cpu=480 \
MaxWall=08:00:00 \
Priority=5
Additionally configure cloud-native budget alerts: AWS Budget alerts, Azure Cost Management alerts, or GCP Budget alerts. Set alerts at 50%, 80%, and 100% of monthly budget.
Workload Selection
Not all HPC jobs are suitable for cloud bursting:
| Workload Type | Cloud Burst Suitable? | Reason |
|---|---|---|
| Embarrassingly parallel | Yes | No inter-node communication |
| Monte Carlo simulations | Yes | Independent replicas |
| Parameter sweeps | Yes | Tasks are independent |
| Tight MPI simulations | Limited | Requires low-latency interconnect |
| Large memory jobs (> 512 GB) | Depends | Cloud instance memory limits |
| Licensed software | Check | Per-socket licensing may be expensive |
Monitoring Burst Activity
# Show cloud nodes and their status
sinfo -p cloud -o "%N %T %C"
# Show pending jobs that will trigger burst
squeue --state=PD -p cloud --format="%.10i %R"
# Cost report: jobs run on cloud partition in last 7 days
sreport cluster AccountUtilizationByUser \
Start=$(date -d "7 days ago" +%Y-%m-%d) \
Cluster=cloud
Cloud bursting extends the return on your on-premise HPC investment while eliminating the need to provision for peak demand. The key design decisions — network connectivity, storage strategy, spot instance policy, and workload selection — determine both the performance and the cost of burst operations. Contact Mevasis for a cloud bursting architecture review tailored to your workload profile.