/ Products

SLURM Workload Manager

SLURM — open-source, fault-tolerant, highly scalable cluster management and job scheduling system for Linux HPC clusters of any size.

SLURM (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant and highly scalable cluster management and job scheduling system for Linux clusters of every size. It requires no kernel modifications, so it doesn’t introduce software compatibility problems — and it powers everything from small on-prem R&D clusters to the world’s largest supercomputers.

Core Functions

As a workload scheduler, SLURM has three core functions:

  1. Grants users exclusive or shared access to compute resources (compute nodes) for a specified duration so they can run their work.
  2. Provides a framework for starting, executing and monitoring jobs on the set of allocated nodes.
  3. Arbitrates contention for resources by managing a queue of pending jobs.

Optional Plugins

  • Accounting — usage records and billing per user/project
  • Advanced resource reservation — for maintenance windows and special workloads
  • Time-sharing for parallel jobs
  • Topology-optimized resource selection — InfiniBand fat-tree, dragonfly, etc.
  • Backfill scheduling
  • Resource limits per user or bank account
  • Multi-factor job-priority algorithms

Architecture

SLURM has a central manager to monitor resources and work. You can also configure a backup manager to take over in case of failure, so jobs keep running uninterrupted on the cluster.

Each compute server runs a slurmd daemon — comparable to a remote shell — that waits for, executes and reports on work. The slurmd clients provide fault-tolerant hierarchical communication.

An optional slurmdbd (SLURM Database Daemon) component records usage and user statistics for one or more SLURM-managed clusters in a single database.

Capabilities

User Commands

SLURM gives end users a small, powerful command set:

CommandDescription
srunLaunches jobs (interactive or batch)
sbatchSubmits batch job scripts to the queue
scancelTerminates queued or running jobs
sinfoReports system status (partitions/nodes)
squeueReports the status of jobs
sacctReports info on running or completed jobs and job steps
smap / sviewGraphical reporting of system and job state, including network topology

Administrator Tools

For cluster administrators:

  • scontrol — view and modify configuration and state
  • sacctmgr — manage the accounting database (users, accounts, limits)

Together they give full control over runtime policies, queue configurations and resource-allocation rules.

Scalability & Fault Tolerance

  • A single SLURM controller can manage tens of thousands of compute nodes
  • Failover support via a backup controller
  • Hierarchical communication keeps any single node from becoming a bottleneck
  • Plugin-based architecture — queue algorithms, accounting back-ends and topology models are all modular and replaceable

A significant fraction of the Top500 supercomputers run SLURM.

Power Management & Resource Policies

  • Fair-share prioritization by user, group, project and account
  • Advanced resource modeling with GPUs / GRES (Generic Resources)
  • Energy savings at the node level (power-down idle nodes)
  • QoS (Quality of Service) classes for different SLA tiers

Licensing & Support

SLURM is open source (GPL-2). Mevasis offers enterprise SLURM installation, configuration, version migration, plugin development and 24/7 technical support. Get in touch via the contact page.