/ Blog

What Is a Job Scheduler? How HPC Cluster Queues Work

What is a job scheduler, why is it necessary, and how does it work? SLURM, PBS, LSF, and Grid Engine systems, key concepts (job, queue, wall time, backfill), job script examples, and the scheduling decision process.

An HPC cluster may house hundreds of servers and thousands of CPU cores. On any given day, dozens to hundreds of researchers want to use this shared resource simultaneously — each claiming “my job is the most important.” Without some form of organized management, all users would connect at the same time, try to use all resources at once, and the system would either collapse or become unfairly distributed. This is exactly the problem a job scheduler solves.

What Is a Job Scheduler?

A job scheduler (also called a workload manager or batch system) is software that takes a set of compute resources (CPU cores, memory, GPUs, network bandwidth) and distributes them fairly and efficiently among multiple user jobs.

In the simplest form: users submit jobs to a queue. The scheduler evaluates each job, determines which nodes have sufficient resources, and when the right time comes, it starts the job. When the job finishes, those resources return to the pool and the next eligible job runs.

Key Concepts

ConceptDefinition
JobA unit of work submitted for execution (a script, binary, or command)
Queue / PartitionA logical grouping of resources with defined policies
NodeA physical server in the cluster
CoreAn individual CPU thread available for computation
Wall timeThe maximum elapsed time a job is allowed to run
PriorityA numerical score determining which job runs next when resources are limited
BackfillRunning smaller jobs in gaps left by large pending jobs, without impacting priority

Why Not Just SSH and Run Directly?

On small clusters, some teams do exactly this: log in over SSH and run jobs interactively. This approach breaks down quickly:

  • Two users try to run memory-intensive jobs on the same node → OOM kill, one or both jobs fail.
  • A job runs indefinitely and starves other researchers for days.
  • No record of who used which resources when → impossible to do capacity planning.
  • GPUs sit idle overnight because no one scheduled overnight jobs.

A scheduler solves all of these problems: it enforces resource limits, tracks usage, fills idle gaps automatically, and provides fair access across all users.

Major HPC Schedulers

SchedulerLicenseTypical Environment
SLURMOpen source (GPLv2)Universities, research centers, national labs, most new HPC clusters
PBS / OpenPBSOpen source (AGPLv3)Legacy clusters, some national facilities
LSF (IBM Spectrum LSF)CommercialFinancial services, pharmaceutical industry
Grid Engine (UGE/OGE)Open source / commercialLegacy academic environments

SLURM (Simple Linux Utility for Resource Management) has become the dominant choice for new deployments. Its active development, large community, and strong integration with NVIDIA GPU management make it the de facto standard.

SLURM Job Script Example

A SLURM job is submitted as a shell script with #SBATCH directives at the top. Here is a realistic example for a CFD simulation with OpenFOAM:

#!/bin/bash
#SBATCH --job-name=cfd_cavity
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --mem-per-cpu=4G
#SBATCH --time=08:00:00
#SBATCH --partition=compute
#SBATCH --output=cfd_%j.log
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=researcher@university.edu

# Load required modules
module load openfoam/10 openmpi/4.1.6

# Set up case directory
cd $SLURM_SUBMIT_DIR/cavity_case

# Run parallel CFD
mpirun -np $SLURM_NTASKS foamRun -parallel

# Collect results to single output
reconstructPar -latestTime

Key directives:

  • --nodes=4 --ntasks-per-node=32: Request 4 nodes × 32 cores = 128 MPI processes total.
  • --time=08:00:00: If the job runs longer than 8 hours, it is automatically terminated.
  • --partition=compute: Target the compute partition specifically.
  • $SLURM_NTASKS: Environment variable SLURM sets automatically to the total number of tasks (128 here).

Submit with sbatch job.sh. Check status with squeue -u $USER. Cancel with scancel <jobid>.

The Scheduling Decision Process

When a job is submitted, the scheduler works through three phases:

1. Resource eligibility check

  • Does this user’s account exist? Are they within their QOS CPU/memory/GPU limits?
  • Are there enough free nodes that match the job’s resource request?
  • Does the partition allow this wall time?

2. Priority calculation

If multiple jobs are eligible, the scheduler computes a priority score for each. SLURM’s multi-factor priority considers:

  • Fairshare: Has this group been using more or less than their allocation lately? Under-users get higher priority.
  • Age: How long has the job been waiting? Older jobs gain priority over time.
  • QOS: Does the job’s QOS class carry a priority multiplier?
  • Job size: Small jobs sometimes get a slight bonus to help them find gaps.

3. Backfill analysis

After selecting the highest-priority job, the scheduler looks at the rest of the queue. If a large job is waiting for a node that will free up in 2 hours, and a smaller job can complete in 1 hour using the same node, the backfill scheduler starts the smaller job immediately. The large job still starts at its reserved time — backfill never delays higher-priority work.

From the Administrator’s Perspective

A cluster administrator using SLURM has direct control over all of these knobs:

  • Partition definitions: Which nodes belong to which queue, what are the default/maximum wall times, which users can access which partition.
  • QOS policies: Per-user or per-group limits on concurrent jobs, CPU cores, memory, GPUs.
  • Priority weights: How much fairshare, age, and QOS contribute to the overall score.
  • Node states: Mark nodes as drain (don’t start new jobs, finish current ones) for maintenance; down for hardware failures; resume to bring nodes back.
# Show all jobs in the queue
squeue --format="%.10i %.9P %.20j %.8u %.8T %.10M %R"

# Drain a node for maintenance
scontrol update nodename=node07 state=drain reason="hardware check"

# Display a job's resource usage after completion
seff 12345

# Show priority components for pending jobs
sprio -l

Conclusion

A job scheduler is the invisible engine that makes shared HPC resources usable in practice. Without it, an HPC cluster is just an expensive pile of servers. With a well-configured scheduler, those servers can keep CPU utilization above 90%, provide fair access to dozens of research teams simultaneously, and automatically recover from node failures without administrator intervention.

For SLURM installation, configuration, and optimization services, visit our HPC solutions page or contact us.