Multi-Tenant HPC: Resource Isolation and Fair Sharing

Imagine a university HPC cluster: the bioinformatics group runs RNA-seq analysis overnight, the machine learning team fills GPU nodes by morning, and the engineering department wants CFD resources for weeks. All these groups share the same hardware — yet no one wants their work interrupted by someone else’s jobs. This scenario is the essence of multi-tenant HPC.

This post covers SLURM’s accounting (fairshare), QOS, and cgroup-based isolation mechanisms with practical configuration examples.

Why Multi-Tenancy Is Complex

On a single-user system, scheduling decisions are simple: if resources are free, run; otherwise wait. When multiple tenants (groups, projects, or departments) enter the picture, several tensions emerge:

Fairness: How does a group’s short-term over-use affect others over time?
Isolation: Can a memory leak in one job crash a neighboring job?
Priority: Should an urgent analysis or a week-old batch job run first?
Visibility: Should each tenant see only their own jobs or the entire queue?

SLURM has tools to address all of these questions — but they must be configured together, consistently.

Accounting with Hierarchical Account Structure

The foundation of everything is SLURM’s accounting layer. The slurmdbd service stores user, group, and usage data in a database; fairshare and QOS policies depend on this data.

Creating an Account Hierarchy

A typical university structure can be modeled like this:

# Create root account
sacctmgr add account root_cluster Description="Cluster Root"

# Department accounts
sacctmgr add account bioinformatics \
    Parent=root_cluster \
    Description="Bioinformatics Department" \
    Fairshare=40

sacctmgr add account ml_lab \
    Parent=root_cluster \
    Description="Machine Learning Lab" \
    Fairshare=35

sacctmgr add account engineering \
    Parent=root_cluster \
    Description="Engineering Department" \
    Fairshare=25

# Add users (linked to accounts)
sacctmgr add user alice Account=bioinformatics DefaultAccount=bioinformatics
sacctmgr add user bob   Account=ml_lab       DefaultAccount=ml_lab
sacctmgr add user carol Account=engineering  DefaultAccount=engineering

# Optional: resource limits at the account level
sacctmgr modify account bioinformatics \
    set GrpCPUs=512 GrpMem=2048G GrpGRES=gpu:8

Fairshare values work as weighted shares: out of 100 total units, bioinformatics gets 40, ml_lab 35, and engineering 25. These are not absolute allocations — they are relative weights used in priority calculation.

Querying Usage Status

# Account-level usage summary
sreport cluster AccountUtilizationByUser Start=2026-06-01 End=2026-06-17

# Display fairshare values
sshare -A bioinformatics,ml_lab,engineering -l

Fairshare: Past Usage and Future Priority

The fairshare mechanism adjusts future job priority based on whether a group has used more or fewer resources than their entitled share. Heavy users see decreased priority; light users gain higher priority. Over time, each group ends up using approximately their allocated share.

Priority Calculation

SLURM’s multi-factor priority system (PriorityType=priority/multifactor) consists of these components:

Component	Description	Typical Weight
Fairshare	Past usage imbalance	30
Age	Time the job has been waiting in queue	20
Job Size	Small job bonus	5
QOS	Service class priority	30
Partition	Partition-level priority	15

These weights are defined in slurm.conf:

PriorityType=priority/multifactor
PriorityWeightFairshare=30
PriorityWeightAge=20
PriorityWeightJobSize=5
PriorityWeightQOS=30
PriorityWeightPartition=15

# Fairshare decay: older usage data fades over time (in days)
PriorityDecayHalfLife=14-0
PriorityUsageResetPeriod=NONE

With PriorityDecayHalfLife=14-0, usage from 14 days ago has half the weight of recent usage. A shorter half-life rewards short-term fairness; a longer one promotes long-term balance.

QOS: Fine-Grained Policy Layer

Quality of Service (QOS) is an additional constraint and privilege mechanism layered on top of the account hierarchy. A user or account can be assigned multiple QOS classes; the user selects which QOS to use at submission time.

QOS Definitions

# Standard QOS for normal usage
sacctmgr add qos normal \
    Priority=10 \
    MaxWall=7-00:00:00 \
    MaxCPUsPerUser=256 \
    MaxGRESPerUser="gpu:4"

# "burst" QOS for short high-priority jobs
sacctmgr add qos burst \
    Priority=50 \
    MaxWall=04:00:00 \
    MaxCPUsPerUser=512 \
    MaxJobsPerUser=2 \
    Preempt=normal \
    PreemptMode=requeue

# Long-running low-priority background jobs
sacctmgr add qos background \
    Priority=1 \
    MaxWall=30-00:00:00 \
    MaxCPUsPerUser=1024 \
    GraceTime=00:30:00

# Assign QOS to accounts
sacctmgr modify account bioinformatics set QOS=normal,background
sacctmgr modify account ml_lab        set QOS=normal,burst,background

Specifying QOS at Job Submission

sbatch --qos=burst --time=02:00:00 --ntasks=64 my_analysis.sh

The burst QOS can preempt normal jobs (requeueing them) but can run for at most 4 hours with at most 2 jobs per user open simultaneously. This structure allows urgent analyses to get fast responses while the system also accommodates long-running background jobs.

Resource Isolation: cgroup-Based Memory and CPU Limits

Fairshare and QOS policies operate at the scheduling layer — they do not limit how many resources a running job actually consumes. Real isolation requires Linux cgroup integration.

slurm.conf and cgroup.conf Configuration

# In slurm.conf
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

# cgroup.conf file
CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes

# Kill (not suspend) jobs that exceed memory limit
MemorySwappiness=0
AllowedRAMSpace=100
AllowedSwapSpace=0

With this configuration, SLURM enforces the CPU cores and memory allocated to a job via the cgroup hierarchy. If a job is submitted with --mem=64G, it cannot use more than 64 GB; the kernel enforces this limit directly.

GPU Isolation

For GPUs, configure hardware bindings and NVIDIA MIG (Multi-Instance GPU) slices via gres.conf:

# GRES configuration (gres.conf)
AutoDetect=nvml
Name=gpu Type=a100 File=/dev/nvidia[0-7]

# Using MIG slices in a job script
#SBATCH --gres=gpu:a100:1
#SBATCH --constraint=mig_3g.40gb

Visibility and Data Access Control

In multi-tenant systems, one group’s jobs should not be visible to another group, and they should not be able to access each other’s output files. This is handled at two levels:

SLURM level: With the PrivateData parameter, users can only see their own job information:

# slurm.conf
PrivateData=jobs,usage,users

Filesystem level: Each group’s working directory should be protected with Unix group permissions or ACLs:

# Create group working directory
mkdir -p /scratch/bioinformatics
chown root:bioinformatics /scratch/bioinformatics
chmod 2770 /scratch/bioinformatics    # setgid bit: files inherit group

On parallel filesystems (Lustre, GPFS), quota management adds another layer:

# Lustre project quota
lfs setquota -p 1001 --block-softlimit 10T --block-hardlimit 12T /scratch

Practical Monitoring: System Administrator Perspective

Keeping a multi-tenant system healthy requires real-time monitoring. Some key commands:

# Fairshare status for all accounts
sshare -a -l | sort -k7 -n

# Queue depth by account
squeue -o "%.18i %.9P %.8j %.8u %.8a %.2t %.10M %.6D %R" | column -t

# CPU and GPU usage for the last 24 hours (by account)
sreport cluster AccountUtilizationByUser \
    Start=$(date -d '24 hours ago' +%Y-%m-%dT%H:%M) \
    End=$(date +%Y-%m-%dT%H:%M) \
    Format=Accounts,Login,Used

# Wait time statistics
sacct -a --starttime=2026-06-01 \
    --format=Account,User,JobID,Submit,Start,Elapsed,CPUTimeRAW \
    --state=COMPLETED | awk 'NR>2 {print $1, $2, $7}'

Feeding these outputs to a monitoring dashboard (Grafana + Prometheus + SLURM exporter) allows you to spot problems before a job ends.

Common Pitfalls and Solutions

Never updating fairshare values: Over years, group resource needs change. Review fairshare weights every six months.

QOS without limits: A QOS without MaxCPUsPerUser or MaxWall allows a single user to fill the entire cluster. Every QOS must have an upper bound.

Memory limits without cgroups: The --mem parameter is advisory to SLURM without cgroup enforcement; memory overruns are only caught at the next scheduling cycle, not at the moment of violation. For real isolation, ConstrainRAMSpace=yes is mandatory.

Undocumented preemption policy: Users who don’t know their jobs can be preempted write code without checkpointing and lose hours of work on restart. Document clearly which QOS classes can preempt which others.

Successfully operating a multi-tenant HPC cluster requires more than procuring hardware. Configuring SLURM’s accounting, fairshare, QOS, and cgroup layers together and consistently provides both system administrators and researchers with a predictable, fair, and isolated working environment.

Mevasis is happy to support you on multi-tenant HPC architecture, SLURM configuration, and resource isolation. Contact us to learn more.