SLURM Installation and Configuration: Complete Setup Guide
End-to-end SLURM HPC cluster setup: pre-installation planning, MUNGE key generation and distribution, partition architecture, cgroup-based CPU and memory isolation, GPU GRES configuration, accounting and fairshare with sacctmgr, common issues table, and Prometheus monitoring integration.
SLURM (Simple Linux Utility for Resource Management) is the de facto standard job scheduler for HPC clusters. Installing SLURM correctly from the beginning saves many hours of debugging later. This guide covers the complete process from pre-installation planning through production monitoring.
Pre-Installation Planning
Before running any install commands, answer these questions. The answers directly shape your configuration:
How many nodes and what types? Knowing the count and node roles (login, compute, GPU, high-memory) determines partition structure and resource labels.
What is the authentication/LDAP environment? SLURM’s accounting layer integrates with system UIDs. Users must exist on all nodes with the same UID/GID. Centralized LDAP or NIS is required for multi-node clusters.
What accounting and fairshare policy is needed? Department-level fairshare requires setting up the slurmdbd database service. Without slurmdbd, per-user resource tracking is not available.
What monitoring tools will be used? Prometheus + Grafana integration should be planned from day one, not retrofitted later.
Step 1: MUNGE Authentication
SLURM uses MUNGE (MUNGE Uid ‘N’ Gid Emporium) for secure inter-daemon authentication. Every node in the cluster must share the same MUNGE key.
# Install MUNGE on ALL nodes
apt-get install -y munge libmunge-dev # Debian/Ubuntu
dnf install -y munge munge-libs munge-devel # RHEL/Rocky
# Generate key on the management node (once only)
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
# Distribute key to ALL compute nodes
for node in node{01..32} gpu{01..08}; do
scp /etc/munge/munge.key ${node}:/etc/munge/
ssh ${node} "chown munge:munge /etc/munge/munge.key && chmod 400 /etc/munge/munge.key"
done
# Enable and start MUNGE on all nodes
systemctl enable --now munge
# Verify MUNGE is working from a compute node
munge -n | ssh node01 unmunge
The MUNGE key must be identical on every node. If any node has a different key, that node will fail to communicate with slurmctld and all jobs to it will fail.
Step 2: SLURM Package Installation
# Ubuntu 22.04 / Debian 12
apt-get install -y slurm-wm slurmdbd slurmctld slurmd
# RHEL 8 / Rocky Linux 8
# SLURM is not in standard repos; build from source or use a package from schedmd.com
dnf install -y gcc munge-devel perl perl-ExtUtils-MakeMaker pam-devel
wget https://download.schedmd.com/slurm/slurm-23.11.1.tar.bz2
tar xjf slurm-23.11.1.tar.bz2 && cd slurm-23.11.1
./configure --prefix=/usr --sysconfdir=/etc/slurm --with-pmix --with-pam_dir=/lib64/security
make -j$(nproc) && make install
Create the SLURM user (same UID on ALL nodes):
# Use a consistent UID across the cluster — here 992
groupadd -g 992 slurm
useradd -m -c "SLURM workload manager" -d /var/lib/slurm \
-u 992 -g slurm -s /bin/bash slurm
mkdir -p /var/spool/slurm /var/log/slurm /etc/slurm
chown slurm:slurm /var/spool/slurm /var/log/slurm
Step 3: Partition Architecture
Design your partition structure before writing slurm.conf. A typical research cluster has these partitions:
# /etc/slurm/slurm.conf (partial — node and partition definitions)
# === NODE DEFINITIONS ===
NodeName=node[01-32] \
CPUs=128 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 \
RealMemory=512000 State=UNKNOWN
NodeName=gpu[01-08] \
CPUs=128 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 \
RealMemory=512000 \
Gres=gpu:a100:4 State=UNKNOWN
NodeName=bigmem[01-04] \
CPUs=256 Sockets=4 CoresPerSocket=32 ThreadsPerCore=2 \
RealMemory=3000000 State=UNKNOWN
# === PARTITION DEFINITIONS ===
PartitionName=debug \
Nodes=node[01-02] Default=NO MaxTime=00:30:00 \
MaxCPUsPerNode=16 State=UP Priority=100
PartitionName=short \
Nodes=node[01-32] Default=YES MaxTime=1-00:00:00 \
State=UP Priority=50 DefMemPerCPU=4096
PartitionName=long \
Nodes=node[01-32] Default=NO MaxTime=7-00:00:00 \
State=UP Priority=20 DefMemPerCPU=4096
PartitionName=gpu \
Nodes=gpu[01-08] Default=NO MaxTime=2-00:00:00 \
State=UP Priority=50 DefMemPerCPU=8192
PartitionName=highmem \
Nodes=bigmem[01-04] Default=NO MaxTime=2-00:00:00 \
State=UP Priority=50
Key slurm.conf global settings:
ClusterName=mycluster
ControlMachine=slurm-master
SlurmUser=slurm
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmix
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-master
JobAcctGatherType=jobacct_gather/cgroup
Step 4: cgroup-Based Resource Isolation
Without cgroup enforcement, SLURM’s memory and CPU limits are advisory only — a job that exceeds its allocation can impact neighboring jobs. Enable strict enforcement:
# /etc/slurm/cgroup.conf
CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=100
AllowedSwapSpace=0
MemorySwappiness=0
With ConstrainRAMSpace=yes, if a job requests --mem=64G and its processes exceed 64 GB, the kernel’s OOM killer terminates the offending processes within the job — not other users’ jobs.
Test cgroup enforcement after enabling:
# Submit a memory stress test
sbatch --mem=2G --wrap="stress-ng --vm 1 --vm-bytes 4G --timeout 60s"
# Job should be killed by OOM before 60 seconds; check with `sacct -j <jobid>`
Step 5: GPU GRES Configuration
For GPU nodes, SLURM requires GRES (Generic RESource) configuration to track GPU allocation:
# /etc/slurm/gres.conf (on GPU nodes ONLY)
AutoDetect=nvml
# If AutoDetect doesn't work, specify explicitly:
# Name=gpu Type=a100 File=/dev/nvidia0
# Name=gpu Type=a100 File=/dev/nvidia1
# Name=gpu Type=a100 File=/dev/nvidia2
# Name=gpu Type=a100 File=/dev/nvidia3
In slurm.conf, add:
GresTypes=gpu
Test GPU GRES is working:
# Submit a GPU job
sbatch --gres=gpu:1 --partition=gpu --wrap="nvidia-smi"
# Check GPU was actually allocated
squeue -j <jobid> -o "%i %b" # should show "gpu:1"
Step 6: Accounting and Fairshare with sacctmgr
slurmdbd stores all job accounting data in a MySQL/MariaDB database. Configure slurmdbd.conf and create the database first:
# Install and configure MariaDB
apt-get install -y mariadb-server
mysql -u root -e "
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'strongpassword';
CREATE DATABASE slurm_acct_db;
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';
FLUSH PRIVILEGES;"
# /etc/slurm/slurmdbd.conf
DbdHost=slurm-master
StorageHost=localhost
StorageUser=slurm
StoragePass=strongpassword
StorageLoc=slurm_acct_db
LogFile=/var/log/slurm/slurmdbd.log
systemctl enable --now slurmdbd
Set up accounting hierarchy:
# Add the cluster to accounting
sacctmgr add cluster mycluster
# Create department accounts with fairshare weights
sacctmgr add account bioinformatics Description="BioInfo Lab" Fairshare=40
sacctmgr add account ml_research Description="ML Research" Fairshare=35
sacctmgr add account engineering Description="Engineering" Fairshare=25
# Add users to accounts
sacctmgr add user alice Account=bioinformatics DefaultAccount=bioinformatics
sacctmgr add user bob Account=ml_research DefaultAccount=ml_research
# Set resource limits at account level
sacctmgr modify account bioinformatics set GrpCPUs=512 GrpMem=2048G
# Enable multi-factor priority (add to slurm.conf)
# PriorityType=priority/multifactor
# PriorityWeightFairshare=10000
# PriorityWeightAge=1000
# PriorityDecayHalfLife=14-0
Common Issues and Solutions
| Problem | Symptom | Solution |
|---|---|---|
| MUNGE key mismatch | Jobs submitted to specific nodes fail with auth error | Re-copy /etc/munge/munge.key to all nodes; restart munge everywhere |
Node shows drain | squeue shows drained reason for pending jobs | scontrol update nodename=nodeXX state=resume after resolving the underlying issue |
Jobs stuck PD with Resources reason | Resources are available but job won’t start | Check scontrol show node for AllocMem, AllocCPUs; look for ghost allocations with squeue -a |
| GPU not allocated | --gres=gpu:1 job runs but nvidia-smi shows no GPU | Verify gres.conf on the GPU node and that NVML autodetect is working: slurmd -C |
| Memory limit ignored | OOM doesn’t kill over-limit jobs | Verify cgroup v2 is mounted: mount | grep cgroup2; check cgroup.conf |
| slurmdbd unreachable | sacctmgr hangs or returns empty | Check MariaDB is running; verify AccountingStorageHost matches in slurm.conf |
| License conflict | Jobs pending with Licenses reason | Check scontrol show lic; release stale licenses with scontrol update LicenseName=X Count=N |
Prometheus Monitoring Integration
Add SLURM metrics to Prometheus using the SLURM exporter:
# Install slurm-exporter
wget https://github.com/vpenso/prometheus-slurm-exporter/releases/download/0.19/prometheus-slurm-exporter_0.19_linux_amd64.tar.gz
tar xzf prometheus-slurm-exporter_0.19_linux_amd64.tar.gz
mv prometheus-slurm-exporter /usr/local/bin/
systemctl enable --now prometheus-slurm-exporter
Add to prometheus.yml:
scrape_configs:
- job_name: slurm
scrape_interval: 30s
static_configs:
- targets: ["slurm-master:8080"]
Critical metrics to alert on:
| Metric | Alert Condition | Severity |
|---|---|---|
slurm_nodes_down | > 0 for 5 minutes | Warning |
slurm_queue_pending | > 100 for 30 minutes | Info |
slurm_cpus_idle / total | < 5% for 1 hour | Warning (cluster underutilized) |
slurm_jobs_failed rate | > 10/minute | Critical |
slurm_nodes_drain | > 0 for 1 hour | Warning |
Starting Services
# Start in order: munge → slurmdbd → slurmctld (on management node)
systemctl enable --now munge
systemctl enable --now slurmdbd
systemctl enable --now slurmctld
# On compute/GPU nodes: munge → slurmd
systemctl enable --now munge
systemctl enable --now slurmd
# Verify cluster is up
sinfo -l # should show all partitions and nodes
scontrol show config # confirm config was parsed correctly
sacctmgr show cluster # confirm slurmdbd connection
A properly configured SLURM cluster is the foundation of a productive HPC environment. Taking the time to set up cgroup isolation, accounting, and monitoring from the start pays dividends in operational stability and user satisfaction. For turnkey SLURM installation and configuration services, contact Mevasis.