Job Scheduler Technical Guide: SLURM Architecture, PBS, LSF, and Configuration
HPC job scheduler technical guide: SLURM architecture (slurmctld, slurmd, slurmdbd, MUNGE), PBS Pro and LSF comparison, prerequisites, partition and QOS design, fairshare configuration, common problems (pending jobs, drain nodes, MPI performance), and best practices.
An HPC job scheduler is the software layer that transforms a collection of independent compute nodes into a shared, multi-user computing service. Without a scheduler, cluster access degrades to manual coordination — which breaks down quickly beyond a handful of users. This guide covers the technical architecture of the three major schedulers and the key configuration decisions that determine how well a scheduler serves its user community.
SLURM Architecture
SLURM (Simple Linux Utility for Resource Management) is the most widely deployed HPC scheduler, present on over 60% of the Top500 supercomputers. Its architecture has four components:
slurmctld — The Controller
The central scheduling daemon. Runs on the management node (or a pair of nodes for HA). All scheduling decisions originate here:
- Accepts job submissions via
sbatch,salloc,srun - Maintains the state of all nodes and jobs
- Runs the backfill scheduler to maximize utilization
- Calls
ResumeProgram/SuspendProgramfor cloud bursting
slurmctld is stateful — it maintains the full cluster state in memory and periodic state files. Loss of the management node loses all queued jobs. HA mode (active/backup controller) is essential for production.
slurmd — The Node Daemon
Runs on every compute node. Receives job launch instructions from slurmctld, starts the user’s processes, monitors resource usage, enforces limits via cgroups, and reports node health back to the controller.
slurmdbd — The Database Daemon
The accounting daemon. Stores in a MySQL/MariaDB database:
- All submitted, running, completed, and failed jobs
- User accounts and group allocations
- Fairshare usage history
- QOS definitions
Without slurmdbd, fairshare is unavailable and job history is not retained across slurmctld restarts.
MUNGE — Authentication
MUNGE provides authentication tokens that allow SLURM daemons on different hosts to verify each other’s identity. A shared MUNGE key (generated once and distributed to all nodes) is the security foundation. If nodes have mismatched MUNGE keys, slurmd cannot communicate with slurmctld.
# Generate MUNGE key on controller
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chmod 400 /etc/munge/munge.key && chown munge:munge /etc/munge/munge.key
# Distribute to all compute nodes (example with parallel-ssh)
parallel-ssh -h /etc/cluster/nodes \
"scp controller:/etc/munge/munge.key /etc/munge/munge.key && \
chmod 400 /etc/munge/munge.key && \
chown munge:munge /etc/munge/munge.key && \
systemctl restart munge"
# Verify MUNGE works
munge -n | unmunge # should decode successfully
PBS Pro and OpenPBS
PBS (Portable Batch System) has a longer history than SLURM, with deployments dating to the early 1990s. Altair’s commercial product PBS Pro remains in wide use, particularly in aerospace, automotive, and EDA/semiconductor industries.
Key PBS concepts that differ from SLURM:
| SLURM Concept | PBS Equivalent |
|---|---|
| Partition | Queue |
sbatch | qsub |
squeue | qstat |
scancel | qdel |
| Node (resource) | Node or chunk |
sinfo | pbsnodes -a |
sacctmgr | pbs_account |
PBS’s selection syntax is more expressive for complex resource requests:
# PBS job script example
#PBS -N my_simulation
#PBS -q gpu_queue
#PBS -l select=4:ncpus=64:mem=256gb:ngpus=4
#PBS -l walltime=12:00:00
cd $PBS_O_WORKDIR
mpirun ./simulation
PBS also supports “select” semantics where select=4 means 4 “chunks” and each chunk specifies its resource requirements independently — useful for heterogeneous jobs.
IBM Spectrum LSF
LSF (Load Sharing Facility) is the dominant scheduler in financial services HPC and large enterprise environments. Key differentiators:
- License-aware scheduling: LSF can monitor FlexLM license servers and hold jobs that would exceed license count limits
- Fairshare at department/project level: Hierarchical fairshare with multiple levels of quota
- Resource borrowing: Unused quota from one group can be temporarily lent to another
- Advanced features: Application-level policies, SLA enforcement, job requeue with backfill
LSF is a commercial product with mandatory per-socket licensing that makes it significantly more expensive than SLURM for large clusters.
Prerequisites for SLURM Deployment
# All nodes must have:
# 1. Synchronized time (NTP/Chrony)
timedatectl set-ntp true
chronyc tracking # verify offset < 1 second
# 2. Consistent user/group IDs (from LDAP or /etc/passwd sync)
id slurm # verify same UID on all nodes
# 3. Shared filesystem for SLURM state directory
# (NFS or shared storage for /var/spool/slurmctld)
mount -t nfs nfs-server:/slurm-state /var/spool/slurmctld
# 4. MUNGE running on all nodes
systemctl is-active munge
# 5. Firewall rules allowing SLURM ports
# slurmctld: 6817, slurmd: 6818, slurmdbd: 6819
firewall-cmd --add-port=6817-6819/tcp --permanent && firewall-cmd --reload
Partition and QOS Design
# /etc/slurm/slurm.conf — production example
ClusterName=hpc-prod
ControlMachine=mgmt01
ControlAddr=10.0.1.10
BackupController=mgmt02 # HA failover
# Authentication
AuthType=auth/munge
CryptoType=crypto/munge
# Resource isolation via cgroups
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup
# Scheduling
SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_test=2000,bf_resolution=300
# Priority
PriorityType=priority/multifactor
PriorityWeightFairshare=10000
PriorityWeightAge=1000
PriorityDecayHalfLife=14-0
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=mgmt01
JobAcctGatherType=jobacct_gather/cgroup
# Node definitions
NodeName=cn[01-64] CPUs=128 RealMemory=512000 State=UNKNOWN
NodeName=gpu[01-08] CPUs=64 RealMemory=512000 Gres=gpu:h100:8 State=UNKNOWN
# Partitions
PartitionName=compute Nodes=cn[01-64] Default=YES MaxTime=7-00:00:00 State=UP
PartitionName=gpu Nodes=gpu[01-08] Default=NO MaxTime=48:00:00 State=UP
PartitionName=debug Nodes=cn[01-02] Default=NO MaxTime=00:30:00 MaxNodes=2 State=UP
# QOS configuration via sacctmgr
sacctmgr add qos normal Priority=10 MaxWall=7-00:00:00 MaxCPUSPerUser=1024
sacctmgr add qos high Priority=50 MaxWall=12:00:00 MaxCPUSPerUser=512 Preempt=normal
sacctmgr add account all_users Fairshare=1000
sacctmgr add user alice Account=all_users QOS=normal,high DefaultQOS=normal
Common Problems
Jobs Pending Indefinitely
# Diagnose why a job is pending
squeue -j 12345 --start
scontrol show job 12345 | grep -E "Reason|ReqNodes|NumNodes|MinMemory"
# Most common PENDING reasons:
# Resources: Not enough free nodes right now → wait
# Priority: Resources available but a higher-priority job is in line
# QOSMaxCPUPerUserLimit: User hit their CPU cap → reduce parallel jobs
# ReqNodeNotAvail: Specific nodes requested are drained/down
# Check node availability
sinfo --state=idle --format="%N %C"
# C = cores: allocated/idle/other/total
Drained and Down Nodes
# See all non-healthy nodes with reasons
sinfo --state=drain,down,fail --format="%N %T %R"
# Common drain reasons and remedies:
# "kill task failed" → slurmd couldn't clean up after last job; check /proc
# "Low socket buffer" → kernel network buffer exhausted; check sysctl
# "slurmd failure" → slurmd crashed; restart with systemctl restart slurmd
# "Not responding" → node unreachable; check network and IPMI
# Resume a node after investigation/fix
scontrol update NodeName=cn05 State=resume
# Drain a node for planned maintenance
scontrol update NodeName=cn05 State=drain Reason="Scheduled maintenance"
MPI Performance Lower Than Expected on SLURM
# Common cause: MPI processes binding to wrong CPU cores or using wrong network
# Check binding with verbose srun
srun --cpu-bind=verbose --ntasks=64 ./my_mpi_job 2>&1 | head -10
# Force InfiniBand (not TCP) for MPI
export OMPI_MCA_btl=^tcp
export OMPI_MCA_btl_openib_allow_ib=1
# Check InfiniBand usage during job
mpirun -np 64 --mca pml ob1 --mca btl openib,self --mca btl_base_verbose 30 \
./my_mpi_job 2>&1 | grep -c "openib"
# Run IMB-MPI1 PingPong within SLURM to baseline
sbatch --nodes=2 --ntasks-per-node=1 --wrap="mpirun ./IMB-MPI1 PingPong"
Best Practices
- Enable cgroup integration from day one. Without
ConstrainRAMSpace=yes, a job that requests 64 GB but allocates 256 GB will crash the node and kill other users’ jobs. - Deploy slurmdbd before first production use. Retroactively importing historical accounting data is painful; starting with accounting from day one is free.
- Test HA failover regularly. SLURM’s HA controller failover is not automatic in all versions — test the manual procedure quarterly so that when it happens in production, it is not the first time.
- Document partition rationale. When there are 6 partitions, new sysadmins need to understand why each exists. Write it down.
For SLURM deployment, migration from PBS/LSF, and scheduling policy design, contact Mevasis.