SLURM (Simple Linux Utility for Resource Management) is the workload manager running on the majority of HPC clusters worldwide — more than 60% of TOP500 systems depend on it. This guide covers the essential commands, job script writing, partition management, and troubleshooting from beginner to intermediate level.
How SLURM Works: Core Architecture
SLURM consists of three primary components:
- slurmctld (Controller Daemon): Makes central scheduling decisions, manages job queues
- slurmd (Node Daemon): Runs on each compute node, launches and monitors jobs
- slurmdbd (Database Daemon): Stores accounting data, job history, and usage records
When you submit a job, sbatch queues it; slurmctld reserves appropriate nodes; slurmd launches the job.
Command Reference
Job Submission
# Submit a batch script
sbatch myjob.sh
# Inline without a script
sbatch --ntasks=8 --time=02:00:00 --mem=32G --wrap="python train.py"
# Interactive session
srun --ntasks=4 --time=01:00:00 --pty bash
# Interactive session with GPU allocation
srun --partition=gpu --gres=gpu:h100:2 --ntasks=1 --pty bash
Job Status Monitoring
squeue # All queued jobs
squeue -u $USER # Your own jobs
squeue -j <job_id> # Specific job
squeue -p gpu # Jobs in a specific partition
squeue --format="%i %j %T %R" # Custom format: ID, name, state, reason
scontrol show job <job_id> # Detailed job information
sinfo # Cluster node status
sinfo -p compute # Specific partition status
Cancellation and Control
scancel <job_id> # Cancel specific job
scancel -u $USER # Cancel all your jobs
scancel -u $USER -t PENDING # Cancel only pending jobs
scontrol hold <job_id> # Hold a job
scontrol release <job_id> # Release the hold
Accounting and Reporting
sacct -j <job_id>
sacct -j <job_id> --format=JobID,JobName,State,CPUTime,MaxRSS,Elapsed
sacct -u $USER --starttime=2026-01-01 --format=JobID,State,Elapsed
Writing Job Scripts: Core Structure
SLURM job scripts use #SBATCH directives to specify resources. Here is a baseline template:
#!/bin/bash
#SBATCH --job-name=my_simulation # Job name
#SBATCH --output=logs/%j.out # Stdout: %j = job ID
#SBATCH --error=logs/%j.err # Stderr
#SBATCH --partition=compute # Target partition
#SBATCH --nodes=4 # Number of nodes
#SBATCH --ntasks-per-node=64 # Tasks per node
#SBATCH --cpus-per-task=1 # CPUs per task
#SBATCH --mem=256G # Memory per node
#SBATCH --time=12:00:00 # Max wall time (HH:MM:SS)
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@institution.edu
# Environment setup
module load openmpi/4.1.5 gcc/12.3
echo "Job ID: $SLURM_JOB_ID"
echo "Nodes: $SLURM_JOB_NODELIST"
echo "Start: $(date)"
srun ./my_mpi_program input.dat
echo "End: $(date)"
GPU Job Scripts
GPU workloads require additional directives:
#!/bin/bash
#SBATCH --job-name=deep_learning
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4 # One task per GPU
#SBATCH --gres=gpu:h100:4 # 4× H100 per node
#SBATCH --cpus-per-task=16 # 16 CPUs per GPU
#SBATCH --mem=320G
#SBATCH --time=48:00:00
module load cuda/12.3 pytorch/2.2
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=12355
srun python -m torch.distributed.run \
--nproc_per_node=4 \
--nnodes=$SLURM_NNODES \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train.py --epochs 100 --batch-size 256
Array Jobs: Parametric Studies
Job arrays dramatically improve efficiency for parametric sweeps:
#!/bin/bash
#SBATCH --job-name=param_sweep
#SBATCH --array=1-100 # 100 parameter sets
#SBATCH --array=1-100%10 # Max 10 running simultaneously
#SBATCH --ntasks=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00
PARAM=$(sed -n "${SLURM_ARRAY_TASK_ID}p" parameters.txt)
srun ./simulation --param $PARAM --output results/run_${SLURM_ARRAY_TASK_ID}.dat
Job arrays significantly reduce scheduler overhead compared to submitting individual jobs.
Job Dependencies: Workflow Chaining
# Stage 1: preprocessing
jid1=$(sbatch --parsable preprocess.sh)
# Stage 2: run only if stage 1 succeeded
jid2=$(sbatch --parsable --dependency=afterok:$jid1 simulate.sh)
# Stage 3: analysis
sbatch --dependency=afterok:$jid2 analyze.sh
Dependency types:
| Type | Meaning |
|---|---|
afterok | After dependent job completes successfully |
afternotok | After dependent job fails |
afterany | After dependent job ends, regardless of state |
after | After dependent job begins execution |
Partitions and QOS
Cluster administrators define partitions for different job types:
sinfo -o "%P %D %C %l" # Partition name, node count, CPU state, max time
Typical partition layout:
| Partition | Purpose | Max Time | Priority |
|---|---|---|---|
debug | Short test jobs | 1 hour | High |
compute | General CPU workloads | 72 hours | Normal |
gpu | GPU computation | 48 hours | Normal |
bigmem | High-memory jobs | 24 hours | Normal |
long | Extended simulations | 7 days | Low |
Resource Optimization
Right-Sizing Memory Requests
Requesting excess memory disadvantages you: fewer nodes are available to host your job, increasing queue wait time.
# Check actual memory usage of completed job
sacct -j <job_id> --format=MaxRSS
# Next submission: use MaxRSS × 1.2 + 10% safety margin
Checking Expected Start Time
sbatch --test-only myjob.sh # Show estimated start without actually submitting
Troubleshooting: Why Is My Job PENDING?
squeue -j <job_id> -o "%R" # Show reason for pending
Common reasons:
| Reason | Meaning and Resolution |
|---|---|
Resources | Insufficient resources available; request less or wait |
Priority | Lower priority than competing jobs |
QOSMaxJobsPerUser | User job limit reached |
ReqNodeNotAvail | Requested node in maintenance; change partition |
DependencyNeverSatisfied | Dependency cannot be met; inspect parent job |
Memory Errors (OOM Killed)
sacct -j <job_id> --format=State,ExitCode,MaxRSS
# ExitCode 137 or State=OUT_OF_MEMORY → increase --mem
Mevasis HPC Training Services
Mevasis offers HPC training programs covering SLURM administration, job script optimization, and HPC software stack management. Programs are available for both cluster administrators and end users.
Frequently Asked Questions
Should I use SLURM or PBS/Torque? For new deployments, SLURM is strongly recommended. Active development, a large community, and solid cloud integration make it the de facto standard. Migrating existing PBS installations typically requires a 6–12 month project.
How do I submit a single command without a script?
Use sbatch --wrap="command" or srun command. For repeated work, job scripts are preferred for reproducibility and easier debugging.
What is the largest scale SLURM can handle? SLURM theoretically scales to 100,000+ nodes. Practical limits are typically network fabric and parallel filesystem bandwidth — MPI efficiency degrades when these saturate.
How do I run Jupyter Notebook on a compute node? Start a Jupyter Server on a compute node and use SSH port forwarding to access it from your workstation via the login node. The Open OnDemand platform provides a browser-based interface that automates this process.