HPC Operations Guide: Daily Checks, Queue Management, Maintenance Windows

Running an HPC cluster is an ongoing operational discipline, not a one-time installation task. The daily, weekly, and monthly rhythms of operations work determine whether the cluster delivers reliable service to researchers or becomes a source of frustration for everyone involved.

Morning Health Check

Every HPC operations shift should start with a systematic health check. Automate it:

#!/bin/bash
# /usr/local/sbin/hpc-morning-check.sh
# Run daily at 07:00 via cron

echo "=== HPC Cluster Morning Health Check $(date) ==="
echo ""

echo "--- SLURM Node Status ---"
sinfo --format="%20N %8T %C %G %R" --sort=T
echo ""

echo "--- Nodes Not IDLE or ALLOCATED ---"
sinfo --state=down,drain,fail --noheader --format="%N %T %R" | grep -v "^$"
echo ""

echo "--- Job Queue Summary ---"
squeue --format="%.5D %8T" --noheader | sort | uniq -c
echo ""

echo "--- Long-Pending Jobs (> 4h) ---"
squeue --state=PD --noheader \
  --format="%.8i %.9P %.8j %.8u %.10M %R" | \
  awk -F: 'NR>1 || $0 !~ /^[[:space:]]*[0-9]+:[0-9]+:[0-9]+/ {print}' | \
  head -20
echo ""

echo "--- Storage Health ---"
df -h /mnt/scratch /mnt/project /home | column -t
echo ""

echo "--- GPU Node Summary ---"
if command -v nvidia-smi &>/dev/null; then
  nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total \
    --format=csv,noheader 2>/dev/null | head -20
fi
echo ""

echo "--- Recent System Errors ---"
journalctl --since "yesterday" -p err --no-pager | tail -10
echo ""

echo "=== Check complete ==="

# Cron: send morning check to ops team at 7:00 AM
0 7 * * * /usr/local/sbin/hpc-morning-check.sh | \
  mail -s "HPC Morning Check $(date +%Y-%m-%d)" hpc-ops@example.com

Node Health Metrics

Track these metrics per compute node continuously:

Metric	Warning Threshold	Critical Threshold	Action
CPU temperature	> 75°C	> 85°C	Check cooling, throttle if critical
Memory ECC errors	> 0 correctable/day	Any uncorrectable	Drain node, schedule replacement
GPU temperature	> 80°C	> 88°C	Check thermal solution, reduce power limit
GPU ECC double-bit	0	> 0 (any)	Drain node immediately, GPU replacement
Disk usage (/)	> 80%	> 90%	Clean logs/tmp, expand if needed
Network port errors	> 10/day	> 100/day	Check cable/transceiver, replace if needed
SLURM node state	drain	down	Investigate, fix, resume

# Check GPU ECC errors on all GPU nodes
for node in $(sinfo -p gpu -N --noheader -o "%N"); do
  ssh $node "nvidia-smi --query-gpu=index,ecc.errors.uncorrected.volatile.total \
    --format=csv,noheader" | \
  while IFS=, read idx errors; do
    if [ "$errors" -gt "0" ] 2>/dev/null; then
      echo "ALERT: GPU $idx on $node has $errors uncorrected ECC errors"
    fi
  done
done

SLURM Queue Management

Partition Design

Partition design determines which jobs compete with each other and what resource limits apply:

# /etc/slurm/slurm.conf — partition definitions

# Fast turnaround for small, short jobs
PartitionName=debug \
  Nodes=cn[01-02] \
  MaxTime=00:30:00 \
  MaxNodes=2 \
  MaxCPUsPerNode=128 \
  State=UP \
  Priority=200

# Default partition for regular batch jobs
PartitionName=compute \
  Nodes=cn[01-64] \
  Default=YES \
  MaxTime=7-00:00:00 \
  State=UP \
  Priority=50

# GPU jobs
PartitionName=gpu \
  Nodes=gpu[01-08] \
  MaxTime=2-00:00:00 \
  State=UP \
  Priority=50

# Low-priority, preemptable (uses all idle capacity)
PartitionName=preemptable \
  Nodes=cn[01-64] \
  MaxTime=UNLIMITED \
  PreemptMode=REQUEUE \
  State=UP \
  Priority=1

QOS Management

Quality of Service controls per-user and per-account resource limits:

# Create standard QOS
sacctmgr add qos standard \
  Priority=10 \
  MaxWall=7-00:00:00 \
  MaxCPUsPerUser=512 \
  MaxGRESPerUser=gpu:4

# Create burst QOS (high priority, short duration)
sacctmgr add qos burst \
  Priority=50 \
  MaxWall=04:00:00 \
  MaxJobsPerUser=2 \
  Preempt=standard \
  PreemptMode=requeue

# Assign QOS to an account
sacctmgr modify account research set QOS=standard,burst

# Check current QOS assignments
sacctmgr show qos format=Name,Priority,MaxWall,MaxCPUSPerUser,MaxGRES

Diagnosing Pending Jobs

# Why is job 12345 pending?
squeue -j 12345 --start
scontrol show job 12345 | grep -E "Reason|Priority|NodeList"

# Show pending jobs with reasons grouped
squeue --state=PD --format="%R" --noheader | sort | uniq -c | sort -rn

# Typical PENDING reasons and remedies:
# Resources       → Not enough free nodes; wait or reduce job size
# Priority        → Higher-priority jobs ahead; wait
# QOSMaxCPUPerUserLimit → User hit their QOS CPU limit
# PartitionNodeLimit    → Job requests more nodes than partition allows
# ReqNodeNotAvail → Specific nodes requested are down/drained

Backfill Tuning

Backfill scheduling fills gaps in the queue by running small jobs while waiting for large jobs’ resources to become available:

# slurm.conf backfill settings
SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_test=2000,bf_resolution=300,\
  bf_max_time=120,bf_continue,bf_window=2880,bf_yield_interval=2000000

bf_max_job_test=2000 means each backfill cycle evaluates up to 2000 pending jobs. bf_window=2880 looks 48 hours ahead. Larger values improve backfill utilization at the cost of higher controller CPU usage.

User Support: Common Issues

Node Failures During Jobs

# A node went down during a job — rescue the user's data
scontrol show job <jobid> | grep NodeList
# SSH to the remaining nodes and check for output files

# Drain the failed node for investigation
scontrol update NodeName=cn07 State=drain Reason="Network errors"

# After hardware fix, resume the node
scontrol update NodeName=cn07 State=resume

Debugging MPI Job Startup Failures

# Run with verbose MPI output
mpirun -np 128 -v --report-bindings ./simulation

# Check if all MPI processes started
# In the SLURM output file, count "Hello from rank X" lines
grep "Hello from rank" job_output.txt | wc -l
# Should equal --ntasks

# Check for SLURM step errors
scontrol show steps <jobid>

Storage Full — Emergency Response

# Find largest files/directories consuming space
du -sh /mnt/scratch/* | sort -rh | head -20

# Find jobs with unusually large scratch usage
for user in $(ls /mnt/scratch); do
  du -sh /mnt/scratch/$user 2>/dev/null | awk -v u=$user '{print $1, u}'
done | sort -rh | head -10

# Gracefully notify affected user and extend time to clean up
# or enforce quota immediately via BeeGFS quota
beegfs-ctl --setquota --uid=$(id -u $username) \
  --size 5T --files 1000000 /mnt/scratch

Maintenance Windows

Plan maintenance windows during low-utilization periods (typically weekends):

# Step 1: Drain all nodes to prevent new job starts
scontrol update PartitionName=compute State=drain

# Step 2: Wait for running jobs to complete (or set a time limit)
squeue --state=R --noheader | wc -l
# Wait until this shows 0

# Step 3: Announce maintenance (automatically, via SLURM reservation message)
scontrol create reservation \
  ReservationName=maintenance-2026-06-20 \
  StartTime=2026-06-20T22:00:00 \
  Duration=06:00:00 \
  Nodes=ALL \
  Flags=MAINT,Ignore_Jobs \
  Users=root

# Step 4: Perform maintenance (OS updates, hardware, firmware)
ansible compute_nodes -m shell -a "yum update -y && reboot"

# Step 5: Return nodes to service
scontrol update PartitionName=compute State=up

# Step 6: Verify all nodes came back
sinfo | grep -v IDLE | grep -v ALLOC

Scratch Cleanup Policy

Automatic scratch cleanup prevents filesystem capacity issues:

#!/bin/bash
# /usr/local/sbin/scratch-cleanup.sh
# Remove files in /scratch older than 30 days

SCRATCH_DIR=/mnt/scratch
MAX_AGE_DAYS=30
LOG_FILE=/var/log/scratch-cleanup.log

echo "$(date): Starting scratch cleanup" >> $LOG_FILE

# Find and delete files older than MAX_AGE_DAYS (but not actively in use)
find $SCRATCH_DIR -type f -atime +$MAX_AGE_DAYS -not -newer /tmp/scratch-protect \
  -exec rm -f {} \; 2>> $LOG_FILE

# Remove empty directories
find $SCRATCH_DIR -type d -empty -delete 2>> $LOG_FILE

echo "$(date): Scratch cleanup complete. Space: $(df -h $SCRATCH_DIR | tail -1)" >> $LOG_FILE

Monthly KPI Review

Track these metrics monthly:

KPI	Source	Target
CPU utilization %	SLURM `sreport`	70–85%
GPU utilization %	DCGM Exporter / Grafana	> 60%
Median job wait time	`sacct` analysis	< 30 min (short jobs)
Jobs completed per week	`sacct --state=COMPLETED`	Trending up
Node availability	SLURM state history	> 98%
Storage fill rate	`df` trend	< 80% at month end
User-reported incidents	Helpdesk tickets	Trending down

Present these monthly to HPC stakeholders. Trends over time reveal scheduling policy problems, hardware degradation, and capacity planning needs before they become crises.

Operational excellence in HPC is built from consistent processes and visibility. Contact Mevasis for HPC operations consulting, managed monitoring, and on-call support services.

HPC Cluster Operations: Daily Health Checks, SLURM Queue Management, and Maintenance