/ Blog

Cloud Bursting for HPC: Architecture, SLURM Configuration, and Cost Control

How to implement cloud bursting for HPC clusters: SLURM scheduler configuration, network connectivity options, spot/preemptible instances, and integration with AWS, Azure, and Google Cloud.

Research computing demand is inherently bursty. A typical HPC cluster runs at 60–70% average utilization punctuated by episodes where every user simultaneously needs maximum resources. Purchasing hardware to cover peak demand leaves expensive capacity idle the rest of the time. Cloud bursting solves this by extending the on-premise cluster into the cloud only when demand exceeds local capacity.

What Is Cloud Bursting?

Cloud bursting is the practice of running HPC jobs on dynamically provisioned cloud instances when the on-premise cluster queue is full. From the user perspective, nothing changes: jobs are submitted to SLURM with the same sbatch commands. The scheduler transparently routes excess demand to cloud-provisioned nodes.

The on-premise cluster handles the steady-state workload, where InfiniBand interconnects and local NVMe storage deliver full HPC performance. Cloud nodes absorb burst demand for loosely-coupled workloads that tolerate higher inter-node latency and shared network storage.

SLURM Configuration for Cloud Bursting

SLURM’s burst capability relies on the ResumeProgram and SuspendProgram hooks. Define a cloud partition separate from the on-premise partition:

# /etc/slurm/slurm.conf

# On-premise nodes — always running
PartitionName=onprem Nodes=cn[01-32] Default=YES MaxTime=INFINITE State=UP

# Cloud partition — nodes start suspended and are launched on demand
PartitionName=cloud Nodes=cloud[01-128] MaxTime=08:00:00 State=UP

# Autoscaling hooks
ResumeProgram=/usr/local/sbin/resume-cloud-node.sh
SuspendProgram=/usr/local/sbin/suspend-cloud-node.sh
ResumeTimeout=360        # 6 minutes for cloud instance boot
SuspendTime=300          # idle time before terminating node

The resume-cloud-node.sh script calls the cloud API to launch a pre-configured instance. The suspend-cloud-node.sh terminates it. The node must register with SLURM before ResumeTimeout expires, so pre-baked AMI/images with all software pre-installed are essential.

Network Connectivity

Three connectivity patterns are used in practice:

IPsec VPN: Lower cost, higher latency (~5–20 ms). Suitable for loosely-coupled jobs (embarrassingly parallel, parameter sweeps). Inadequate for tight MPI workloads.

Dedicated interconnect (AWS Direct Connect / Azure ExpressRoute): Fixed latency (~1–3 ms), predictable bandwidth. Required when cloud nodes need to read from on-premise parallel storage. Higher monthly cost.

Cloud-native storage only: The cloud burst jobs read input data from S3-compatible object storage and write results back to object storage. No dedicated interconnect needed. Suitable for large-scale, data-independent simulations.

Spot and Preemptible Instances

Cloud instances can be reduced to 10–30% of on-demand price using Spot (AWS), Spot VMs (Azure), or Preemptible VMs (GCP). The trade-off is that the provider can terminate the instance with 2-minute notice.

To handle preemption gracefully:

# slurm.conf — enable job requeue on node failure
JobRequeue=1
InactiveLimit=0

# Make the cloud partition preemptable with requeue
PartitionName=cloud PreemptMode=REQUEUE

Applications that write periodic checkpoints can survive preemption by restarting from the last checkpoint. DMTCP (Distributed MultiThreaded CheckPointing) or application-native checkpointing (NAMD, OpenFOAM) are appropriate for long-running burst jobs.

AWS ParallelCluster Integration

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: burst
      ComputeResources:
        - Name: hpc6a-48xlarge
          InstanceType: hpc6a.48xlarge
          MinCount: 0
          MaxCount: 64
          SpotPrice: "0.60"    # use Spot at max $0.60/hr
      Networking:
        SubnetIds:
          - subnet-burst-az1
          - subnet-burst-az2
      Image:
        Os: alinux2
        CustomAmi: ami-0abcdef1234567890    # pre-baked AMI

Azure CycleCloud Integration

# cluster template excerpt
[[nodearray burst]]
  MachineType = Standard_HBv3-120
  Azure.MaxScaleSetSize = 200
  InitialCount = 0
  MaxCount = 64
  Interruptible = true           # Spot instances
  MaxPrice = 1.50                # $/hour ceiling

  [[[configuration]]]
    slurm.autoscale = true
    slurm.partition = cloud

Google Cloud HPC Toolkit

- id: slurm_partition_burst
  source: community/modules/scheduler/SchedMD-slurm-on-gcp-partition
  settings:
    partition_name: cloud
    machine_type: c2-standard-60
    max_node_count: 48
    enable_spot_vm: true
    spot_instance_config:
      termination_action: STOP

Cost Control Mechanisms

Uncontrolled cloud bursting can generate unexpected bills. Implement multiple guardrails:

# SLURM: limit total CPU-hours for the cloud partition
sacctmgr modify partition cloud set GrpTRESMins=cpu=100000

# SLURM: limit per-job resource allocation in cloud partition
sacctmgr add qos cloud_burst \
  MaxTRESPerJob=cpu=480 \
  MaxWall=08:00:00 \
  Priority=5

Additionally configure cloud-native budget alerts: AWS Budget alerts, Azure Cost Management alerts, or GCP Budget alerts. Set alerts at 50%, 80%, and 100% of monthly budget.

Workload Selection

Not all HPC jobs are suitable for cloud bursting:

Workload TypeCloud Burst Suitable?Reason
Embarrassingly parallelYesNo inter-node communication
Monte Carlo simulationsYesIndependent replicas
Parameter sweepsYesTasks are independent
Tight MPI simulationsLimitedRequires low-latency interconnect
Large memory jobs (> 512 GB)DependsCloud instance memory limits
Licensed softwareCheckPer-socket licensing may be expensive

Monitoring Burst Activity

# Show cloud nodes and their status
sinfo -p cloud -o "%N %T %C"

# Show pending jobs that will trigger burst
squeue --state=PD -p cloud --format="%.10i %R"

# Cost report: jobs run on cloud partition in last 7 days
sreport cluster AccountUtilizationByUser \
  Start=$(date -d "7 days ago" +%Y-%m-%d) \
  Cluster=cloud

Cloud bursting extends the return on your on-premise HPC investment while eliminating the need to provision for peak demand. The key design decisions — network connectivity, storage strategy, spot instance policy, and workload selection — determine both the performance and the cost of burst operations. Contact Mevasis for a cloud bursting architecture review tailored to your workload profile.