/ Blog

SLURM Installation and Configuration: Complete Setup Guide

End-to-end SLURM HPC cluster setup: pre-installation planning, MUNGE key generation and distribution, partition architecture, cgroup-based CPU and memory isolation, GPU GRES configuration, accounting and fairshare with sacctmgr, common issues table, and Prometheus monitoring integration.

SLURM (Simple Linux Utility for Resource Management) is the de facto standard job scheduler for HPC clusters. Installing SLURM correctly from the beginning saves many hours of debugging later. This guide covers the complete process from pre-installation planning through production monitoring.

Pre-Installation Planning

Before running any install commands, answer these questions. The answers directly shape your configuration:

How many nodes and what types? Knowing the count and node roles (login, compute, GPU, high-memory) determines partition structure and resource labels.

What is the authentication/LDAP environment? SLURM’s accounting layer integrates with system UIDs. Users must exist on all nodes with the same UID/GID. Centralized LDAP or NIS is required for multi-node clusters.

What accounting and fairshare policy is needed? Department-level fairshare requires setting up the slurmdbd database service. Without slurmdbd, per-user resource tracking is not available.

What monitoring tools will be used? Prometheus + Grafana integration should be planned from day one, not retrofitted later.

Step 1: MUNGE Authentication

SLURM uses MUNGE (MUNGE Uid ‘N’ Gid Emporium) for secure inter-daemon authentication. Every node in the cluster must share the same MUNGE key.

# Install MUNGE on ALL nodes
apt-get install -y munge libmunge-dev    # Debian/Ubuntu
dnf install -y munge munge-libs munge-devel   # RHEL/Rocky

# Generate key on the management node (once only)
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

# Distribute key to ALL compute nodes
for node in node{01..32} gpu{01..08}; do
    scp /etc/munge/munge.key ${node}:/etc/munge/
    ssh ${node} "chown munge:munge /etc/munge/munge.key && chmod 400 /etc/munge/munge.key"
done

# Enable and start MUNGE on all nodes
systemctl enable --now munge

# Verify MUNGE is working from a compute node
munge -n | ssh node01 unmunge

The MUNGE key must be identical on every node. If any node has a different key, that node will fail to communicate with slurmctld and all jobs to it will fail.

Step 2: SLURM Package Installation

# Ubuntu 22.04 / Debian 12
apt-get install -y slurm-wm slurmdbd slurmctld slurmd

# RHEL 8 / Rocky Linux 8
# SLURM is not in standard repos; build from source or use a package from schedmd.com
dnf install -y gcc munge-devel perl perl-ExtUtils-MakeMaker pam-devel
wget https://download.schedmd.com/slurm/slurm-23.11.1.tar.bz2
tar xjf slurm-23.11.1.tar.bz2 && cd slurm-23.11.1
./configure --prefix=/usr --sysconfdir=/etc/slurm --with-pmix --with-pam_dir=/lib64/security
make -j$(nproc) && make install

Create the SLURM user (same UID on ALL nodes):

# Use a consistent UID across the cluster — here 992
groupadd -g 992 slurm
useradd -m -c "SLURM workload manager" -d /var/lib/slurm \
    -u 992 -g slurm -s /bin/bash slurm
mkdir -p /var/spool/slurm /var/log/slurm /etc/slurm
chown slurm:slurm /var/spool/slurm /var/log/slurm

Step 3: Partition Architecture

Design your partition structure before writing slurm.conf. A typical research cluster has these partitions:

# /etc/slurm/slurm.conf (partial — node and partition definitions)

# === NODE DEFINITIONS ===
NodeName=node[01-32] \
    CPUs=128 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 \
    RealMemory=512000 State=UNKNOWN

NodeName=gpu[01-08] \
    CPUs=128 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 \
    RealMemory=512000 \
    Gres=gpu:a100:4 State=UNKNOWN

NodeName=bigmem[01-04] \
    CPUs=256 Sockets=4 CoresPerSocket=32 ThreadsPerCore=2 \
    RealMemory=3000000 State=UNKNOWN

# === PARTITION DEFINITIONS ===
PartitionName=debug \
    Nodes=node[01-02] Default=NO MaxTime=00:30:00 \
    MaxCPUsPerNode=16 State=UP Priority=100

PartitionName=short \
    Nodes=node[01-32] Default=YES MaxTime=1-00:00:00 \
    State=UP Priority=50 DefMemPerCPU=4096

PartitionName=long \
    Nodes=node[01-32] Default=NO MaxTime=7-00:00:00 \
    State=UP Priority=20 DefMemPerCPU=4096

PartitionName=gpu \
    Nodes=gpu[01-08] Default=NO MaxTime=2-00:00:00 \
    State=UP Priority=50 DefMemPerCPU=8192

PartitionName=highmem \
    Nodes=bigmem[01-04] Default=NO MaxTime=2-00:00:00 \
    State=UP Priority=50

Key slurm.conf global settings:

ClusterName=mycluster
ControlMachine=slurm-master
SlurmUser=slurm
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmix
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-master
JobAcctGatherType=jobacct_gather/cgroup

Step 4: cgroup-Based Resource Isolation

Without cgroup enforcement, SLURM’s memory and CPU limits are advisory only — a job that exceeds its allocation can impact neighboring jobs. Enable strict enforcement:

# /etc/slurm/cgroup.conf
CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=100
AllowedSwapSpace=0
MemorySwappiness=0

With ConstrainRAMSpace=yes, if a job requests --mem=64G and its processes exceed 64 GB, the kernel’s OOM killer terminates the offending processes within the job — not other users’ jobs.

Test cgroup enforcement after enabling:

# Submit a memory stress test
sbatch --mem=2G --wrap="stress-ng --vm 1 --vm-bytes 4G --timeout 60s"
# Job should be killed by OOM before 60 seconds; check with `sacct -j <jobid>`

Step 5: GPU GRES Configuration

For GPU nodes, SLURM requires GRES (Generic RESource) configuration to track GPU allocation:

# /etc/slurm/gres.conf (on GPU nodes ONLY)
AutoDetect=nvml

# If AutoDetect doesn't work, specify explicitly:
# Name=gpu Type=a100 File=/dev/nvidia0
# Name=gpu Type=a100 File=/dev/nvidia1
# Name=gpu Type=a100 File=/dev/nvidia2
# Name=gpu Type=a100 File=/dev/nvidia3

In slurm.conf, add:

GresTypes=gpu

Test GPU GRES is working:

# Submit a GPU job
sbatch --gres=gpu:1 --partition=gpu --wrap="nvidia-smi"
# Check GPU was actually allocated
squeue -j <jobid> -o "%i %b"   # should show "gpu:1"

Step 6: Accounting and Fairshare with sacctmgr

slurmdbd stores all job accounting data in a MySQL/MariaDB database. Configure slurmdbd.conf and create the database first:

# Install and configure MariaDB
apt-get install -y mariadb-server
mysql -u root -e "
CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'strongpassword';
CREATE DATABASE slurm_acct_db;
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';
FLUSH PRIVILEGES;"

# /etc/slurm/slurmdbd.conf
DbdHost=slurm-master
StorageHost=localhost
StorageUser=slurm
StoragePass=strongpassword
StorageLoc=slurm_acct_db
LogFile=/var/log/slurm/slurmdbd.log

systemctl enable --now slurmdbd

Set up accounting hierarchy:

# Add the cluster to accounting
sacctmgr add cluster mycluster

# Create department accounts with fairshare weights
sacctmgr add account bioinformatics Description="BioInfo Lab" Fairshare=40
sacctmgr add account ml_research    Description="ML Research"  Fairshare=35
sacctmgr add account engineering    Description="Engineering"   Fairshare=25

# Add users to accounts
sacctmgr add user alice Account=bioinformatics DefaultAccount=bioinformatics
sacctmgr add user bob   Account=ml_research    DefaultAccount=ml_research

# Set resource limits at account level
sacctmgr modify account bioinformatics set GrpCPUs=512 GrpMem=2048G

# Enable multi-factor priority (add to slurm.conf)
# PriorityType=priority/multifactor
# PriorityWeightFairshare=10000
# PriorityWeightAge=1000
# PriorityDecayHalfLife=14-0

Common Issues and Solutions

ProblemSymptomSolution
MUNGE key mismatchJobs submitted to specific nodes fail with auth errorRe-copy /etc/munge/munge.key to all nodes; restart munge everywhere
Node shows drainsqueue shows drained reason for pending jobsscontrol update nodename=nodeXX state=resume after resolving the underlying issue
Jobs stuck PD with Resources reasonResources are available but job won’t startCheck scontrol show node for AllocMem, AllocCPUs; look for ghost allocations with squeue -a
GPU not allocated--gres=gpu:1 job runs but nvidia-smi shows no GPUVerify gres.conf on the GPU node and that NVML autodetect is working: slurmd -C
Memory limit ignoredOOM doesn’t kill over-limit jobsVerify cgroup v2 is mounted: mount | grep cgroup2; check cgroup.conf
slurmdbd unreachablesacctmgr hangs or returns emptyCheck MariaDB is running; verify AccountingStorageHost matches in slurm.conf
License conflictJobs pending with Licenses reasonCheck scontrol show lic; release stale licenses with scontrol update LicenseName=X Count=N

Prometheus Monitoring Integration

Add SLURM metrics to Prometheus using the SLURM exporter:

# Install slurm-exporter
wget https://github.com/vpenso/prometheus-slurm-exporter/releases/download/0.19/prometheus-slurm-exporter_0.19_linux_amd64.tar.gz
tar xzf prometheus-slurm-exporter_0.19_linux_amd64.tar.gz
mv prometheus-slurm-exporter /usr/local/bin/
systemctl enable --now prometheus-slurm-exporter

Add to prometheus.yml:

scrape_configs:
  - job_name: slurm
    scrape_interval: 30s
    static_configs:
      - targets: ["slurm-master:8080"]

Critical metrics to alert on:

MetricAlert ConditionSeverity
slurm_nodes_down> 0 for 5 minutesWarning
slurm_queue_pending> 100 for 30 minutesInfo
slurm_cpus_idle / total< 5% for 1 hourWarning (cluster underutilized)
slurm_jobs_failed rate> 10/minuteCritical
slurm_nodes_drain> 0 for 1 hourWarning

Starting Services

# Start in order: munge → slurmdbd → slurmctld (on management node)
systemctl enable --now munge
systemctl enable --now slurmdbd
systemctl enable --now slurmctld

# On compute/GPU nodes: munge → slurmd
systemctl enable --now munge
systemctl enable --now slurmd

# Verify cluster is up
sinfo -l            # should show all partitions and nodes
scontrol show config # confirm config was parsed correctly
sacctmgr show cluster # confirm slurmdbd connection

A properly configured SLURM cluster is the foundation of a productive HPC environment. Taking the time to set up cgroup isolation, accounting, and monitoring from the start pays dividends in operational stability and user satisfaction. For turnkey SLURM installation and configuration services, contact Mevasis.