HPC Autoscaling: Dynamic Node Management with SLURM and Cloud Platforms
How to configure SLURM autoscaling for HPC clusters using AWS ParallelCluster, Azure CycleCloud, and Google Cloud HPC Toolkit. ResumeProgram, SuspendProgram, and cloud integration.
Static HPC cluster sizing always involves a trade-off: provision for peak load and you pay for idle capacity most of the time; provision for average load and users wait during peak periods. Autoscaling resolves this dilemma by dynamically adding and removing compute nodes based on actual queue demand.
How SLURM Autoscaling Works
SLURM’s autoscaling mechanism relies on two hook scripts: ResumeProgram and SuspendProgram. When a job is pending and no compute node is available, SLURM calls ResumeProgram with the list of nodes to bring online. When a node has been idle beyond SuspendTime seconds, SLURM calls SuspendProgram to terminate it.
# /etc/slurm/slurm.conf — autoscaling parameters
ResumeProgram=/usr/local/sbin/slurm-resume.sh
SuspendProgram=/usr/local/sbin/slurm-suspend.sh
ResumeTimeout=300 # seconds to wait for node to become ready
SuspendTime=120 # seconds of idle time before suspending
SuspendExcNodes=cn[01-04] # nodes to never suspend (always-on baseline)
The slurm-resume.sh script calls the cloud provider API to launch instances using a pre-configured image. The slurm-suspend.sh script terminates them. The cloud provider and instance type are determined at design time; only the node count varies at runtime.
AWS ParallelCluster
AWS ParallelCluster is the reference implementation for SLURM-based autoscaling on AWS. It provisions head nodes, compute nodes, and shared storage automatically.
# config.yaml — AWS ParallelCluster configuration
Region: eu-west-1
ClusterName: hpc-prod
HeadNode:
InstanceType: c6i.2xlarge
Networking:
SubnetId: subnet-abc123
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: compute
ComputeResources:
- Name: c6i-32xlarge
InstanceType: c6i.32xlarge
MinCount: 0
MaxCount: 64
Networking:
SubnetIds:
- subnet-abc123
Image:
Os: alinux2
SharedStorage:
- MountDir: /shared
StorageType: Efs
EfsSettings:
FileSystemId: fs-0123456789
ParallelCluster handles the ResumeProgram and SuspendProgram scripts internally. Administrators configure MinCount and MaxCount per queue; SLURM manages instance lifecycle automatically.
Azure CycleCloud
Azure CycleCloud provides a web-based interface for defining cluster templates and integrates tightly with Azure Spot instances for cost optimization.
# cyclecloud_cluster.yaml — Azure CycleCloud template
[cluster HPC]
[[node defaults]]
Credentials = azure-credentials
SubnetId = /subscriptions/.../subnets/compute
[[nodearray compute]]
MachineType = Standard_HB120rs_v3
Azure.MaxScaleSetSize = 300
InitialCount = 0
MaxCount = 128
Interruptible = true # use Spot instances
MaxPrice = 0.5
[[[configuration]]]
slurm.autoscale = true
slurm.default_partition = true
Azure CycleCloud’s autoscale daemon monitors SLURM’s queue and scales node arrays up or down. The Interruptible = true setting enables Azure Spot instances, which reduces compute costs by up to 90% for fault-tolerant workloads.
Google Cloud HPC Toolkit
Google Cloud HPC Toolkit (formerly Cloud HPC Toolkit) provides Terraform-based deployment of SLURM clusters on Google Cloud.
# hpc_cluster.yaml — Google Cloud HPC Toolkit blueprint
blueprint_name: hpc-cluster-gcp
vars:
project_id: my-hpc-project
region: europe-west4
zone: europe-west4-a
deployment_groups:
- group: primary
modules:
- id: network
source: modules/network/vpc
- id: slurm_cluster
source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller
settings:
machine_type: n2-standard-4
partitions:
- name: compute
machine_type: c2-standard-60
max_node_count: 50
enable_placement: true
The enable_placement: true setting activates placement groups for compact node placement, which is critical for MPI workloads where inter-node latency matters.
Autoscaling Design Considerations
Node startup time: Cloud instances typically take 3–5 minutes to boot and register with SLURM. Set ResumeTimeout to at least 300 seconds and use pre-baked images (with SLURM, MPI libraries, and application software already installed) to minimize startup time.
Always-on baseline: Keep a small number of nodes always running (SuspendExcNodes or MinCount > 0) to handle small interactive jobs that cannot wait for instance startup.
Spot/Preemptible instances: Configure SLURM to requeue jobs on preemption with PreemptMode=REQUEUE. Combine with application-level checkpointing for long-running jobs.
Cost controls: Set hard MaxCount limits per queue. On AWS, configure budget alerts; on Azure, use MaxPrice; on GCP, use committed use discounts for baseline nodes and Spot for burst.
Storage: Shared storage (EFS on AWS, Azure Files on Azure, Filestore on GCP) must be accessible to all dynamically provisioned nodes without manual mount configuration. Include the mount command in the instance startup script.
Monitoring Autoscaling Behavior
# Check current node states (cloud nodes appear as "powering up" or "idle~")
sinfo -o "%N %T %C"
# Show pending jobs waiting for resources
squeue --state=PD --format="%.10i %.9P %.8j %.8u %R"
# Check autoscale events in SLURM log
grep -E "resume|suspend|power" /var/log/slurmctld.log | tail -50
A Prometheus metric from the SLURM exporter (slurm_nodes_state) combined with a Grafana panel showing node count over time gives a clear picture of autoscaling activity and helps tune SuspendTime and MaxCount for your workload profile.
Autoscaling eliminates the binary choice between performance and cost, but it requires careful design of the hook scripts, instance images, and storage layers. Contact Mevasis to design an autoscaling HPC architecture tailored to your workload and cloud environment.