SLURM Job Queue Management: Partition, QOS, Fairshare Guide

SLURM’s job queue management is the mechanism through which an HPC cluster allocates shared resources fairly and efficiently among competing users and projects. Getting it right requires understanding three interacting systems: partitions (resource pools with associated policies), QOS (per-user/account constraints and privileges), and the fairshare priority mechanism.

SLURM Architecture Overview

SLURM has three primary daemons:

slurmctld: The controller, runs on the management node. Makes all scheduling decisions.
slurmd: Runs on every compute node. Executes jobs assigned by slurmctld.
slurmdbd: The accounting daemon. Stores job history, user accounts, fairshare data.

All communicate via MUNGE-authenticated connections. Without slurmdbd, fairshare and QOS accounting are unavailable — it is not optional for production clusters.

Partition Design

Partitions are SLURM’s primary mechanism for segmenting resources with different policies. Good partition design matches the cluster’s actual workload profiles:

# /etc/slurm/slurm.conf — partition definitions

# Small, immediate queue for debugging (reserved nodes, short time limit)
PartitionName=debug \
  Nodes=cn[01-02] \
  MaxTime=00:30:00 \
  MaxNodes=2 \
  MaxCPUsPerNode=128 \
  State=UP \
  Priority=200 \
  Default=NO

# Default for most users: medium batch jobs
PartitionName=compute \
  Nodes=cn[01-64] \
  MaxTime=7-00:00:00 \
  DefMemPerCPU=4096 \
  State=UP \
  Priority=50 \
  Default=YES

# Dedicated GPU partition
PartitionName=gpu \
  Nodes=gpu[01-08] \
  MaxTime=2-00:00:00 \
  DefMemPerCPU=8192 \
  State=UP \
  Priority=50 \
  Default=NO

# High-memory nodes for genomics/large FEM
PartitionName=highmem \
  Nodes=bigmem[01-04] \
  MaxTime=3-00:00:00 \
  DefMemPerCPU=32768 \
  State=UP \
  Priority=50 \
  Default=NO

# Low-priority, preemptable — uses all idle capacity
PartitionName=preemptable \
  Nodes=cn[01-64],gpu[01-08] \
  MaxTime=UNLIMITED \
  State=UP \
  Priority=1 \
  PreemptMode=REQUEUE \
  Default=NO

Jobs in the preemptable partition can use idle capacity that would otherwise sit unused, but they can be evicted when higher-priority jobs need those resources. With PreemptMode=REQUEUE, evicted jobs re-enter the queue rather than being cancelled.

Partition design principles:

Fewer partitions is better — complexity accumulates. Start with 3–4, add only when justified.
Short-time partitions improve scheduling efficiency via backfill.
Never put incompatible hardware (different CPU generations, GPU vs non-GPU) in the same partition.

QOS: Fine-Grained Resource Control

Quality of Service (QOS) adds per-user, per-account constraints on top of partition-level limits. A user can be assigned multiple QOS options; they select which to use at submission time.

# Create QOS levels with sacctmgr

# Standard QOS: default limits for most users
sacctmgr add qos standard \
  Priority=10 \
  MaxWall=7-00:00:00 \
  MaxCPUsPerUser=512 \
  MaxJobsPerUser=20 \
  MaxGRESPerUser="gpu:4"

# Burst QOS: short high-priority jobs, preempts standard
sacctmgr add qos burst \
  Priority=50 \
  MaxWall=04:00:00 \
  MaxCPUsPerUser=1024 \
  MaxJobsPerUser=5 \
  Preempt=standard \
  PreemptMode=requeue

# Background QOS: long-running low-priority jobs
sacctmgr add qos background \
  Priority=1 \
  MaxWall=30-00:00:00 \
  MaxCPUsPerUser=2048 \
  GraceTime=00:30:00   # 30-min warning before preemption

# Assign QOS to accounts
sacctmgr modify account researchgroup set QOS=standard,burst,background DefaultQOS=standard
sacctmgr modify account collab_project set QOS=standard DefaultQOS=standard

To submit with a specific QOS:

sbatch --qos=burst --time=02:00:00 --nodes=4 heavy_analysis.sh

Multi-Factor Priority System

SLURM’s priority calculation combines several factors with configurable weights:

# slurm.conf — priority configuration
PriorityType=priority/multifactor
PriorityWeightFairshare=10000   # dominant factor
PriorityWeightAge=1000
PriorityWeightJobSize=100
PriorityWeightQOS=5000
PriorityWeightPartition=1000

# Fairshare decay: usage older than half-life has half the influence
PriorityDecayHalfLife=14-0      # 14-day half-life
PriorityMaxAge=7-0              # age factor saturates after 7 days
PriorityUsageResetPeriod=NONE   # don't reset usage data on schedule

Priority formula:

Job Priority = (PriorityWeightFairshare × Fairshare Score)
             + (PriorityWeightAge × Age Score)
             + (PriorityWeightJobSize × Job Size Score)
             + (PriorityWeightQOS × QOS Score)
             + (PriorityWeightPartition × Partition Score)

The fairshare weight is set highest (10000) in this example, making historical resource usage the dominant priority factor. A user or account that has consumed less than their fair share has a high fairshare score and thus higher job priority.

# Show current priority breakdown for pending jobs
sprio -l

# Show fairshare status for all accounts
sshare -a -l

# Output columns: Fairshare (ratio of usage to allocation)
# Value < 1: account has over-used (lower priority)
# Value > 1: account has under-used (higher priority)

Fairshare Mechanism

Fairshare ensures that no single user or group monopolizes cluster resources indefinitely. The mechanism works through the accounting database:

# Set up account hierarchy with fairshare weights
# (weights determine proportional share, not absolute allocation)
sacctmgr add account physics_dept Description="Physics Dept" Fairshare=40
sacctmgr add account chemistry_dept Description="Chemistry Dept" Fairshare=35
sacctmgr add account engineering_dept Description="Engineering Dept" Fairshare=25

# Add users to accounts
sacctmgr add user alice Account=physics_dept DefaultAccount=physics_dept
sacctmgr add user bob Account=chemistry_dept DefaultAccount=chemistry_dept

# Check effective fairshare values (computed from usage history)
sshare -A physics_dept,chemistry_dept,engineering_dept -l

A department with Fairshare=40 that has used only 20% of their entitled share has a high fairshare score (under-utilized). Their jobs jump to the front of the queue, catching up over time. A department that consistently over-uses falls behind, giving others a chance to catch up.

User Job Submission Examples

From the user’s perspective, SLURM interaction is through job scripts:

#!/bin/bash
#SBATCH --job-name=protein_fold
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=12:00:00
#SBATCH --qos=standard
#SBATCH --output=protein_%j.out
#SBATCH --error=protein_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=alice@example.edu

module load cuda/12.2 openmpi/4.1.6

# Run AlphaFold2 on 2 nodes, 4 GPUs each
srun python /opt/alphafold/run_alphafold.py \
  --input=/scratch/alice/sequences.fasta \
  --output=/project/alice/results/ \
  --model_preset=multimer

# Useful job management commands
sbatch job.sh              # submit a job
squeue -u $USER            # list my jobs
squeue -p gpu              # list all GPU jobs
scancel 12345              # cancel job 12345
scontrol hold 12345        # hold job (prevent scheduling)
scontrol release 12345     # release held job
scontrol show job 12345    # detailed job information
seff 12345                 # efficiency report (after completion)
sacct -j 12345             # accounting record

Common Queue Problems

Jobs pending with reason “QOSMaxCPUPerUserLimit”:

# User has hit their QOS CPU allocation limit
sacctmgr show qos standard format=Name,MaxCPUSPerUser
# Increase user's limit or tell them to wait for running jobs to complete

Priority inversion: large important jobs stuck behind small unimportant ones:

# Check priority of pending jobs
sprio -l | sort -k3 -rn | head -20

# If high-priority job is waiting for resources held by low-priority jobs:
# Option 1: preemption (if configured)
# Option 2: create a reservation for the important job
scontrol create reservation \
  ReservationName=priority-run \
  StartTime=now \
  Duration=04:00:00 \
  NodeCnt=32 \
  Users=alice

Node drain causing queue backup:

# Check why nodes are drained
sinfo --state=drain --noheader --format="%N %R"

# Common reasons:
# "kill task failed": job cleanup problem
# "Not responding": network/hardware issue
# "manual": admin explicitly drained

# Resume after fix
scontrol update NodeName=cn05 State=resume

Effective SLURM job queue management is both a configuration task and an ongoing operations responsibility. The right partition structure, QOS policy, and fairshare calibration change as the user community and workload mix evolve. Contact Mevasis for SLURM deployment, queue architecture design, and ongoing operations support.

SLURM Job Queue Management: Partitions, QOS, Fairshare, and Priority