SLURM (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant and highly scalable cluster management and job scheduling system for Linux clusters of every size. It requires no kernel modifications, so it doesn’t introduce software compatibility problems — and it powers everything from small on-prem R&D clusters to the world’s largest supercomputers.
Core Functions
As a workload scheduler, SLURM has three core functions:
- Grants users exclusive or shared access to compute resources (compute nodes) for a specified duration so they can run their work.
- Provides a framework for starting, executing and monitoring jobs on the set of allocated nodes.
- Arbitrates contention for resources by managing a queue of pending jobs.
Optional Plugins
- Accounting — usage records and billing per user/project
- Advanced resource reservation — for maintenance windows and special workloads
- Time-sharing for parallel jobs
- Topology-optimized resource selection — InfiniBand fat-tree, dragonfly, etc.
- Backfill scheduling
- Resource limits per user or bank account
- Multi-factor job-priority algorithms
Architecture
SLURM has a central manager to monitor resources and work. You can also configure a backup manager to take over in case of failure, so jobs keep running uninterrupted on the cluster.
Each compute server runs a slurmd daemon — comparable to a remote shell — that waits for, executes and reports on work. The slurmd clients provide fault-tolerant hierarchical communication.
An optional slurmdbd (SLURM Database Daemon) component records usage and user statistics for one or more SLURM-managed clusters in a single database.
Capabilities
User Commands
SLURM gives end users a small, powerful command set:
| Command | Description |
|---|---|
srun | Launches jobs (interactive or batch) |
sbatch | Submits batch job scripts to the queue |
scancel | Terminates queued or running jobs |
sinfo | Reports system status (partitions/nodes) |
squeue | Reports the status of jobs |
sacct | Reports info on running or completed jobs and job steps |
smap / sview | Graphical reporting of system and job state, including network topology |
Administrator Tools
For cluster administrators:
scontrol— view and modify configuration and statesacctmgr— manage the accounting database (users, accounts, limits)
Together they give full control over runtime policies, queue configurations and resource-allocation rules.
Scalability & Fault Tolerance
- A single SLURM controller can manage tens of thousands of compute nodes
- Failover support via a backup controller
- Hierarchical communication keeps any single node from becoming a bottleneck
- Plugin-based architecture — queue algorithms, accounting back-ends and topology models are all modular and replaceable
A significant fraction of the Top500 supercomputers run SLURM.
Power Management & Resource Policies
- Fair-share prioritization by user, group, project and account
- Advanced resource modeling with GPUs / GRES (Generic Resources)
- Energy savings at the node level (power-down idle nodes)
- QoS (Quality of Service) classes for different SLA tiers
Licensing & Support
SLURM is open source (GPL-2). Mevasis offers enterprise SLURM installation, configuration, version migration, plugin development and 24/7 technical support. Get in touch via the contact page.