SLURM User Guide: Job Submission, Monitoring, and Resource Management — Mevasis — HPC Solutions

SLURM (Simple Linux Utility for Resource Management) is the workload manager running on the majority of HPC clusters worldwide — more than 60% of TOP500 systems depend on it. This guide covers the essential commands, job script writing, partition management, and troubleshooting from beginner to intermediate level.

How SLURM Works: Core Architecture

SLURM consists of three primary components:

slurmctld (Controller Daemon): Makes central scheduling decisions, manages job queues
slurmd (Node Daemon): Runs on each compute node, launches and monitors jobs
slurmdbd (Database Daemon): Stores accounting data, job history, and usage records

When you submit a job, sbatch queues it; slurmctld reserves appropriate nodes; slurmd launches the job.

Command Reference

Job Submission

# Submit a batch script
sbatch myjob.sh

# Inline without a script
sbatch --ntasks=8 --time=02:00:00 --mem=32G --wrap="python train.py"

# Interactive session
srun --ntasks=4 --time=01:00:00 --pty bash

# Interactive session with GPU allocation
srun --partition=gpu --gres=gpu:h100:2 --ntasks=1 --pty bash

Job Status Monitoring

squeue                          # All queued jobs
squeue -u $USER                 # Your own jobs
squeue -j <job_id>              # Specific job
squeue -p gpu                   # Jobs in a specific partition
squeue --format="%i %j %T %R"  # Custom format: ID, name, state, reason

scontrol show job <job_id>      # Detailed job information
sinfo                           # Cluster node status
sinfo -p compute                # Specific partition status

Cancellation and Control

scancel <job_id>               # Cancel specific job
scancel -u $USER               # Cancel all your jobs
scancel -u $USER -t PENDING    # Cancel only pending jobs
scontrol hold <job_id>         # Hold a job
scontrol release <job_id>      # Release the hold

Accounting and Reporting

sacct -j <job_id>
sacct -j <job_id> --format=JobID,JobName,State,CPUTime,MaxRSS,Elapsed
sacct -u $USER --starttime=2026-01-01 --format=JobID,State,Elapsed

Writing Job Scripts: Core Structure

SLURM job scripts use #SBATCH directives to specify resources. Here is a baseline template:

#!/bin/bash
#SBATCH --job-name=my_simulation      # Job name
#SBATCH --output=logs/%j.out          # Stdout: %j = job ID
#SBATCH --error=logs/%j.err           # Stderr
#SBATCH --partition=compute           # Target partition
#SBATCH --nodes=4                     # Number of nodes
#SBATCH --ntasks-per-node=64          # Tasks per node
#SBATCH --cpus-per-task=1             # CPUs per task
#SBATCH --mem=256G                    # Memory per node
#SBATCH --time=12:00:00               # Max wall time (HH:MM:SS)
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=user@institution.edu

# Environment setup
module load openmpi/4.1.5 gcc/12.3

echo "Job ID: $SLURM_JOB_ID"
echo "Nodes: $SLURM_JOB_NODELIST"
echo "Start: $(date)"

srun ./my_mpi_program input.dat

echo "End: $(date)"

GPU Job Scripts

GPU workloads require additional directives:

#!/bin/bash
#SBATCH --job-name=deep_learning
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4       # One task per GPU
#SBATCH --gres=gpu:h100:4         # 4× H100 per node
#SBATCH --cpus-per-task=16        # 16 CPUs per GPU
#SBATCH --mem=320G
#SBATCH --time=48:00:00

module load cuda/12.3 pytorch/2.2

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=12355

srun python -m torch.distributed.run \
    --nproc_per_node=4 \
    --nnodes=$SLURM_NNODES \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    train.py --epochs 100 --batch-size 256

Array Jobs: Parametric Studies

Job arrays dramatically improve efficiency for parametric sweeps:

#!/bin/bash
#SBATCH --job-name=param_sweep
#SBATCH --array=1-100             # 100 parameter sets
#SBATCH --array=1-100%10          # Max 10 running simultaneously
#SBATCH --ntasks=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00

PARAM=$(sed -n "${SLURM_ARRAY_TASK_ID}p" parameters.txt)
srun ./simulation --param $PARAM --output results/run_${SLURM_ARRAY_TASK_ID}.dat

Job arrays significantly reduce scheduler overhead compared to submitting individual jobs.

Job Dependencies: Workflow Chaining

# Stage 1: preprocessing
jid1=$(sbatch --parsable preprocess.sh)

# Stage 2: run only if stage 1 succeeded
jid2=$(sbatch --parsable --dependency=afterok:$jid1 simulate.sh)

# Stage 3: analysis
sbatch --dependency=afterok:$jid2 analyze.sh

Dependency types:

Type	Meaning
`afterok`	After dependent job completes successfully
`afternotok`	After dependent job fails
`afterany`	After dependent job ends, regardless of state
`after`	After dependent job begins execution

Partitions and QOS

Cluster administrators define partitions for different job types:

sinfo -o "%P %D %C %l"   # Partition name, node count, CPU state, max time

Typical partition layout:

Partition	Purpose	Max Time	Priority
`debug`	Short test jobs	1 hour	High
`compute`	General CPU workloads	72 hours	Normal
`gpu`	GPU computation	48 hours	Normal
`bigmem`	High-memory jobs	24 hours	Normal
`long`	Extended simulations	7 days	Low

Resource Optimization

Right-Sizing Memory Requests

Requesting excess memory disadvantages you: fewer nodes are available to host your job, increasing queue wait time.

# Check actual memory usage of completed job
sacct -j <job_id> --format=MaxRSS
# Next submission: use MaxRSS × 1.2 + 10% safety margin

Checking Expected Start Time

sbatch --test-only myjob.sh   # Show estimated start without actually submitting

Troubleshooting: Why Is My Job PENDING?

squeue -j <job_id> -o "%R"   # Show reason for pending

Common reasons:

Reason	Meaning and Resolution
`Resources`	Insufficient resources available; request less or wait
`Priority`	Lower priority than competing jobs
`QOSMaxJobsPerUser`	User job limit reached
`ReqNodeNotAvail`	Requested node in maintenance; change partition
`DependencyNeverSatisfied`	Dependency cannot be met; inspect parent job

Memory Errors (OOM Killed)

sacct -j <job_id> --format=State,ExitCode,MaxRSS
# ExitCode 137 or State=OUT_OF_MEMORY → increase --mem

Mevasis HPC Training Services

Mevasis offers HPC training programs covering SLURM administration, job script optimization, and HPC software stack management. Programs are available for both cluster administrators and end users.

Frequently Asked Questions

Should I use SLURM or PBS/Torque? For new deployments, SLURM is strongly recommended. Active development, a large community, and solid cloud integration make it the de facto standard. Migrating existing PBS installations typically requires a 6–12 month project.

How do I submit a single command without a script? Use sbatch --wrap="command" or srun command. For repeated work, job scripts are preferred for reproducibility and easier debugging.

What is the largest scale SLURM can handle? SLURM theoretically scales to 100,000+ nodes. Practical limits are typically network fabric and parallel filesystem bandwidth — MPI efficiency degrades when these saturate.

How do I run Jupyter Notebook on a compute node? Start a Jupyter Server on a compute node and use SSH port forwarding to access it from your workstation via the login node. The Open OnDemand platform provides a browser-based interface that automates this process.