HPC Autoscaling with SLURM, AWS ParallelCluster, Azure CycleCloud

Static HPC cluster sizing always involves a trade-off: provision for peak load and you pay for idle capacity most of the time; provision for average load and users wait during peak periods. Autoscaling resolves this dilemma by dynamically adding and removing compute nodes based on actual queue demand.

How SLURM Autoscaling Works

SLURM’s autoscaling mechanism relies on two hook scripts: ResumeProgram and SuspendProgram. When a job is pending and no compute node is available, SLURM calls ResumeProgram with the list of nodes to bring online. When a node has been idle beyond SuspendTime seconds, SLURM calls SuspendProgram to terminate it.

# /etc/slurm/slurm.conf — autoscaling parameters
ResumeProgram=/usr/local/sbin/slurm-resume.sh
SuspendProgram=/usr/local/sbin/slurm-suspend.sh
ResumeTimeout=300        # seconds to wait for node to become ready
SuspendTime=120          # seconds of idle time before suspending
SuspendExcNodes=cn[01-04] # nodes to never suspend (always-on baseline)

The slurm-resume.sh script calls the cloud provider API to launch instances using a pre-configured image. The slurm-suspend.sh script terminates them. The cloud provider and instance type are determined at design time; only the node count varies at runtime.

AWS ParallelCluster

AWS ParallelCluster is the reference implementation for SLURM-based autoscaling on AWS. It provisions head nodes, compute nodes, and shared storage automatically.

# config.yaml — AWS ParallelCluster configuration
Region: eu-west-1
ClusterName: hpc-prod

HeadNode:
  InstanceType: c6i.2xlarge
  Networking:
    SubnetId: subnet-abc123

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: compute
      ComputeResources:
        - Name: c6i-32xlarge
          InstanceType: c6i.32xlarge
          MinCount: 0
          MaxCount: 64
      Networking:
        SubnetIds:
          - subnet-abc123
      Image:
        Os: alinux2

SharedStorage:
  - MountDir: /shared
    StorageType: Efs
    EfsSettings:
      FileSystemId: fs-0123456789

ParallelCluster handles the ResumeProgram and SuspendProgram scripts internally. Administrators configure MinCount and MaxCount per queue; SLURM manages instance lifecycle automatically.

Azure CycleCloud

Azure CycleCloud provides a web-based interface for defining cluster templates and integrates tightly with Azure Spot instances for cost optimization.

# cyclecloud_cluster.yaml — Azure CycleCloud template
[cluster HPC]
  [[node defaults]]
    Credentials = azure-credentials
    SubnetId = /subscriptions/.../subnets/compute

  [[nodearray compute]]
    MachineType = Standard_HB120rs_v3
    Azure.MaxScaleSetSize = 300
    InitialCount = 0
    MaxCount = 128
    Interruptible = true         # use Spot instances
    MaxPrice = 0.5
    [[[configuration]]]
      slurm.autoscale = true
      slurm.default_partition = true

Azure CycleCloud’s autoscale daemon monitors SLURM’s queue and scales node arrays up or down. The Interruptible = true setting enables Azure Spot instances, which reduces compute costs by up to 90% for fault-tolerant workloads.

Google Cloud HPC Toolkit

Google Cloud HPC Toolkit (formerly Cloud HPC Toolkit) provides Terraform-based deployment of SLURM clusters on Google Cloud.

# hpc_cluster.yaml — Google Cloud HPC Toolkit blueprint
blueprint_name: hpc-cluster-gcp

vars:
  project_id: my-hpc-project
  region: europe-west4
  zone: europe-west4-a

deployment_groups:
  - group: primary
    modules:
      - id: network
        source: modules/network/vpc
      - id: slurm_cluster
        source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller
        settings:
          machine_type: n2-standard-4
          partitions:
            - name: compute
              machine_type: c2-standard-60
              max_node_count: 50
              enable_placement: true

The enable_placement: true setting activates placement groups for compact node placement, which is critical for MPI workloads where inter-node latency matters.

Autoscaling Design Considerations

Node startup time: Cloud instances typically take 3–5 minutes to boot and register with SLURM. Set ResumeTimeout to at least 300 seconds and use pre-baked images (with SLURM, MPI libraries, and application software already installed) to minimize startup time.

Always-on baseline: Keep a small number of nodes always running (SuspendExcNodes or MinCount > 0) to handle small interactive jobs that cannot wait for instance startup.

Spot/Preemptible instances: Configure SLURM to requeue jobs on preemption with PreemptMode=REQUEUE. Combine with application-level checkpointing for long-running jobs.

Cost controls: Set hard MaxCount limits per queue. On AWS, configure budget alerts; on Azure, use MaxPrice; on GCP, use committed use discounts for baseline nodes and Spot for burst.

Storage: Shared storage (EFS on AWS, Azure Files on Azure, Filestore on GCP) must be accessible to all dynamically provisioned nodes without manual mount configuration. Include the mount command in the instance startup script.

Monitoring Autoscaling Behavior

# Check current node states (cloud nodes appear as "powering up" or "idle~")
sinfo -o "%N %T %C"

# Show pending jobs waiting for resources
squeue --state=PD --format="%.10i %.9P %.8j %.8u %R"

# Check autoscale events in SLURM log
grep -E "resume|suspend|power" /var/log/slurmctld.log | tail -50

A Prometheus metric from the SLURM exporter (slurm_nodes_state) combined with a Grafana panel showing node count over time gives a clear picture of autoscaling activity and helps tune SuspendTime and MaxCount for your workload profile.

Autoscaling eliminates the binary choice between performance and cost, but it requires careful design of the hook scripts, instance images, and storage layers. Contact Mevasis to design an autoscaling HPC architecture tailored to your workload and cloud environment.

HPC Autoscaling: Dynamic Node Management with SLURM and Cloud Platforms