SLURM Workload Manager — Mevasis — HPC Solutions

SLURM (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant and highly scalable cluster management and job scheduling system for Linux clusters of every size. It requires no kernel modifications, so it doesn’t introduce software compatibility problems — and it powers everything from small on-prem R&D clusters to the world’s largest supercomputers.

Core Functions

As a workload scheduler, SLURM has three core functions:

Grants users exclusive or shared access to compute resources (compute nodes) for a specified duration so they can run their work.
Provides a framework for starting, executing and monitoring jobs on the set of allocated nodes.
Arbitrates contention for resources by managing a queue of pending jobs.

Optional Plugins

Accounting — usage records and billing per user/project
Advanced resource reservation — for maintenance windows and special workloads
Time-sharing for parallel jobs
Topology-optimized resource selection — InfiniBand fat-tree, dragonfly, etc.
Backfill scheduling
Resource limits per user or bank account
Multi-factor job-priority algorithms

Architecture

SLURM has a central manager to monitor resources and work. You can also configure a backup manager to take over in case of failure, so jobs keep running uninterrupted on the cluster.

Each compute server runs a slurmd daemon — comparable to a remote shell — that waits for, executes and reports on work. The slurmd clients provide fault-tolerant hierarchical communication.

An optional slurmdbd (SLURM Database Daemon) component records usage and user statistics for one or more SLURM-managed clusters in a single database.

Capabilities

User Commands

SLURM gives end users a small, powerful command set:

Command	Description
`srun`	Launches jobs (interactive or batch)
`sbatch`	Submits batch job scripts to the queue
`scancel`	Terminates queued or running jobs
`sinfo`	Reports system status (partitions/nodes)
`squeue`	Reports the status of jobs
`sacct`	Reports info on running or completed jobs and job steps
`smap` / `sview`	Graphical reporting of system and job state, including network topology

Administrator Tools

For cluster administrators:

scontrol — view and modify configuration and state
sacctmgr — manage the accounting database (users, accounts, limits)

Together they give full control over runtime policies, queue configurations and resource-allocation rules.

Scalability & Fault Tolerance

A single SLURM controller can manage tens of thousands of compute nodes
Failover support via a backup controller
Hierarchical communication keeps any single node from becoming a bottleneck
Plugin-based architecture — queue algorithms, accounting back-ends and topology models are all modular and replaceable

A significant fraction of the Top500 supercomputers run SLURM.

Power Management & Resource Policies

Fair-share prioritization by user, group, project and account
Advanced resource modeling with GPUs / GRES (Generic Resources)
Energy savings at the node level (power-down idle nodes)
QoS (Quality of Service) classes for different SLA tiers

Licensing & Support

SLURM is open source (GPL-2). Mevasis offers enterprise SLURM installation, configuration, version migration, plugin development and 24/7 technical support. Get in touch via the contact page.