Job Queue Design: Partition, QOS, and Priority Architecture
Designing SLURM job queue architecture: partition structure, QOS hierarchy, preemption and backfill policies with slurm.conf examples and sacctmgr commands.
In modern HPC clusters, efficient use of compute resources is a function of both hardware capacity and job queue architecture. In an environment where hundreds of users simultaneously submit jobs, the ruleset that determines who gets which resources when — directly affects team productivity, fairness perceptions, and cluster efficiency. This post examines partition structure, QOS hierarchy, preemption mechanisms, and backfill policies from a design perspective, using SLURM as the reference implementation.
Partition Design: Logically Partitioning Resources
Partitions are one of SLURM’s fundamental building blocks. A partition defines which nodes it encompasses, what limits apply within those nodes, and what the default resource limits will be. A well-designed partition structure helps users reach the right resources easily while giving administrators granular control.
Partition Types and Use Cases
In practice, most clusters use multiple partitions, each targeting a different job profile:
| Partition Name | Target Job Type | Typical Wall Time | Priority |
|---|---|---|---|
interactive | Development, testing, debugging | 4 hours | High |
short | Short batch jobs | 24 hours | Normal |
long | Long-running simulations | 7 days | Low |
gpu | GPU-required jobs | 48 hours | Normal |
highmem | Memory-intensive analysis | 48 hours | Normal |
preemptable | Low-priority flexible jobs | Unlimited | Lowest |
In this structure, the interactive partition has a small, fast resource pool while preemptable can use all idle nodes but is subject to interruption when priority jobs arrive.
Partition Definition in slurm.conf
# /etc/slurm/slurm.conf
PartitionName=interactive Nodes=node[01-04] Default=NO MaxTime=04:00:00 \
State=UP DefMemPerCPU=4096 Priority=100
PartitionName=short Nodes=node[01-32] Default=YES MaxTime=1-00:00:00 \
State=UP DefMemPerCPU=4096 Priority=50
PartitionName=long Nodes=node[01-32] Default=NO MaxTime=7-00:00:00 \
State=UP DefMemPerCPU=4096 Priority=20
PartitionName=gpu Nodes=gpu[01-08] Default=NO MaxTime=2-00:00:00 \
State=UP DefMemPerCPU=8192 Priority=50
PartitionName=highmem Nodes=bigmem[01-04] Default=NO MaxTime=2-00:00:00 \
State=UP DefMemPerCPU=32768 Priority=50
PartitionName=preemptable Nodes=node[01-32],gpu[01-08] Default=NO \
MaxTime=UNLIMITED State=UP Priority=1 PreemptMode=REQUEUE
The PreemptMode=REQUEUE setting here means that when a job is interrupted, it waits in queue and automatically restarts when a suitable node becomes available. This is important for long-running analyses — preventing the loss of all progress, as long as the application supports checkpointing.
QOS Hierarchy: Fine-Grained Resource Control
While partition structure provides a coarse-grained cluster-wide separation, the QOS (Quality of Service) layer enables defining limits and privileges at the individual user or group level. QOS works alongside partitions — the most restrictive rule between the two takes precedence.
QOS Levels and Limits
A three-tier QOS hierarchy produces functional results in typical academic or enterprise HPC environments:
Basic User QOS (normal)
- Maximum concurrent running jobs: 10
- Maximum CPU cores: 256
- Maximum memory: 1 TB
- Priority multiplier: 1.0
Research Group QOS (group_standard)
- Maximum concurrent running jobs: 30
- Maximum CPU cores: 1024
- Priority multiplier: 1.5
Priority User QOS (priority_user)
- Reservation rights for quick starts
- Preemption protection
- Priority multiplier: 3.0
# QOS definitions created with sacctmgr
sacctmgr add qos normal \
MaxJobsPerUser=10 MaxTRESPerUser=cpu=256,mem=1000G \
Priority=10
sacctmgr add qos group_standard \
MaxJobsPerUser=30 MaxTRESPerUser=cpu=1024 \
Priority=15 Flags=DenyOnLimit
sacctmgr add qos priority_user \
MaxJobsPerUser=50 MaxTRESPerUser=cpu=2048 \
Priority=30 Flags=NoReserve
# Assign QOS to a user
sacctmgr modify user alice where account=researchgroup \
set qos=normal,group_standard DefaultQOS=group_standard
The DenyOnLimit flag is important: without this flag, when a QOS limit is reached SLURM tries to automatically demote the job to a lower-priority QOS. With DenyOnLimit, the job is rejected when the limit is exceeded and the user gets a clear error message. This is preferred for systems with billing models.
Fairshare and Priority Calculation
SLURM’s multi-factor priority system evaluates several components together:
Job Priority = (PriorityWeightAge × Age Factor)
+ (PriorityWeightFairshare × Fairshare Factor)
+ (PriorityWeightJobSize × Job Size Factor)
+ (PriorityWeightPartition × Partition Factor)
+ (PriorityWeightQOS × QOS Factor)
The fairshare factor tracks historically how much of their allocated resources a group has used. Under-users get a high fairshare score, which means their queued jobs run first. The decay factor reduces the influence of old usage over time, keeping the system balanced.
# Priority weights in slurm.conf
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=100
PriorityWeightPartition=1000
PriorityWeightQOS=5000
PriorityDecayHalfLife=7-0 # 7-day half-life
PriorityMaxAge=7-0
In this configuration, the fairshare weight rises above other factors. A researcher who hasn’t submitted any jobs for a long time gains high priority and can use accumulated resource entitlement.
Preemption: Fast-Tracking Urgent Jobs
Preemption is the mechanism by which a high-priority job can interrupt running lower-priority jobs. If not configured correctly, it can produce chaotic results — but when well-designed, it significantly improves cluster utilization.
Preemption Modes
SLURM offers three preemption modes:
- CANCEL: The lower-priority job is terminated; irrecoverable.
- REQUEUE: The job is stopped and returned to queue. No checkpoint required.
- CHECKPOINT: The application takes a checkpoint and can continue from where it left off later.
For most environments, REQUEUE is the most balanced option. Users don’t lose billed compute time and jobs automatically continue when the system is available.
# Preemption hierarchy definition
# Which QOS can preempt which QOS?
sacctmgr modify qos priority_user set Preempt=normal,group_standard
sacctmgr modify qos group_standard set Preempt=normal
# Enable preemption at partition level
# slurm.conf
PreemptType=preempt/qos
PreemptMode=REQUEUE
PreemptExemptTime=300 # Cannot be preempted for first 5 minutes
The PreemptExemptTime parameter is a small but critical setting. If a newly started job is immediately preempted, the startup cost is wasted. With this value, a job cannot be interrupted until it has run for at least 5 minutes.
Backfill Scheduler: Filling the Gaps
SLURM’s main scheduler runs jobs in priority order. The backfill scheduler optimizes cluster utilization by inserting small jobs into gaps — without jumping ahead in priority — when resources are about to become available.
For example, a high-priority job running up to 2 hours may be waiting for a node expected to finish in 1 hour. The backfill scheduler can insert a 30-minute small job into this window; this way the large job starts on time and no resources sit idle.
# slurm.conf backfill settings
SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_test=1000,bf_resolution=600,\
bf_max_time=60,bf_continue,bf_window=2880
# bf_max_job_test: jobs evaluated per backfill cycle
# bf_resolution: time resolution (seconds)
# bf_max_time: maximum time for backfill calculation (seconds)
# bf_window: lookahead window (minutes, 48 hours here)
The bf_continue parameter is especially important for large clusters: even before a backfill cycle completes, suitable jobs found are started immediately; unnecessary load on the management node is avoided.
Design Principles and Common Mistakes
Understand User Profiles First
Before beginning partition and QOS design, analyzing the cluster’s job profiles is essential. Do most users submit short jobs? Are GPU jobs a priority? Are there a few large users or hundreds of small users? Answers to these questions directly shape the architecture.
Avoid Overly Complex Structures
Dozens of partition and QOS combinations make administration harder and make it difficult for users to make the right selection. Starting with a simple structure and expanding incrementally as needs arise produces much healthier results than starting the other way around.
Keep Limits Realistic
Limits that are too low constrain users and push them to seek workarounds; limits that are too high allow resources to be monopolized by a single user. Ideal limits should be grounded in historical usage data.
Don’t Neglect Monitoring and Reporting
# Useful commands for analyzing queue status
squeue --format="%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R" --sort=-p
sshare -a -l # Fairshare status
sprio -l # Show priority components
sacct -S 2026-06-01 --format=JobID,Partition,QOS,State,CPUTime
When data from these commands is reviewed regularly, the correctness of design decisions can be tested and necessary improvements made.
Conclusion
Effective SLURM job queue architecture requires thinking together about partition definitions, QOS hierarchy, preemption policies, and backfill settings. Every cluster has its own unique job profile and user community — so there is no single universal configuration. What matters is supporting design decisions with data, applying changes in small steps, and continuously monitoring system behavior.
Mevasis is happy to support you on SLURM job queue design and HPC cluster architecture. Contact us.