HPC Cluster Operations: Daily Health Checks, SLURM Queue Management, and Maintenance
HPC cluster operations guide: morning health check script, node health metrics table, SLURM queue management (partition design, QOS, backfill, pending reasons), user support for common issues, maintenance window management, scratch cleanup, backup strategy, and monthly KPI metrics.
Running an HPC cluster is an ongoing operational discipline, not a one-time installation task. The daily, weekly, and monthly rhythms of operations work determine whether the cluster delivers reliable service to researchers or becomes a source of frustration for everyone involved.
Morning Health Check
Every HPC operations shift should start with a systematic health check. Automate it:
#!/bin/bash
# /usr/local/sbin/hpc-morning-check.sh
# Run daily at 07:00 via cron
echo "=== HPC Cluster Morning Health Check $(date) ==="
echo ""
echo "--- SLURM Node Status ---"
sinfo --format="%20N %8T %C %G %R" --sort=T
echo ""
echo "--- Nodes Not IDLE or ALLOCATED ---"
sinfo --state=down,drain,fail --noheader --format="%N %T %R" | grep -v "^$"
echo ""
echo "--- Job Queue Summary ---"
squeue --format="%.5D %8T" --noheader | sort | uniq -c
echo ""
echo "--- Long-Pending Jobs (> 4h) ---"
squeue --state=PD --noheader \
--format="%.8i %.9P %.8j %.8u %.10M %R" | \
awk -F: 'NR>1 || $0 !~ /^[[:space:]]*[0-9]+:[0-9]+:[0-9]+/ {print}' | \
head -20
echo ""
echo "--- Storage Health ---"
df -h /mnt/scratch /mnt/project /home | column -t
echo ""
echo "--- GPU Node Summary ---"
if command -v nvidia-smi &>/dev/null; then
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,memory.used,memory.total \
--format=csv,noheader 2>/dev/null | head -20
fi
echo ""
echo "--- Recent System Errors ---"
journalctl --since "yesterday" -p err --no-pager | tail -10
echo ""
echo "=== Check complete ==="
# Cron: send morning check to ops team at 7:00 AM
0 7 * * * /usr/local/sbin/hpc-morning-check.sh | \
mail -s "HPC Morning Check $(date +%Y-%m-%d)" hpc-ops@example.com
Node Health Metrics
Track these metrics per compute node continuously:
| Metric | Warning Threshold | Critical Threshold | Action |
|---|---|---|---|
| CPU temperature | > 75°C | > 85°C | Check cooling, throttle if critical |
| Memory ECC errors | > 0 correctable/day | Any uncorrectable | Drain node, schedule replacement |
| GPU temperature | > 80°C | > 88°C | Check thermal solution, reduce power limit |
| GPU ECC double-bit | 0 | > 0 (any) | Drain node immediately, GPU replacement |
| Disk usage (/) | > 80% | > 90% | Clean logs/tmp, expand if needed |
| Network port errors | > 10/day | > 100/day | Check cable/transceiver, replace if needed |
| SLURM node state | drain | down | Investigate, fix, resume |
# Check GPU ECC errors on all GPU nodes
for node in $(sinfo -p gpu -N --noheader -o "%N"); do
ssh $node "nvidia-smi --query-gpu=index,ecc.errors.uncorrected.volatile.total \
--format=csv,noheader" | \
while IFS=, read idx errors; do
if [ "$errors" -gt "0" ] 2>/dev/null; then
echo "ALERT: GPU $idx on $node has $errors uncorrected ECC errors"
fi
done
done
SLURM Queue Management
Partition Design
Partition design determines which jobs compete with each other and what resource limits apply:
# /etc/slurm/slurm.conf — partition definitions
# Fast turnaround for small, short jobs
PartitionName=debug \
Nodes=cn[01-02] \
MaxTime=00:30:00 \
MaxNodes=2 \
MaxCPUsPerNode=128 \
State=UP \
Priority=200
# Default partition for regular batch jobs
PartitionName=compute \
Nodes=cn[01-64] \
Default=YES \
MaxTime=7-00:00:00 \
State=UP \
Priority=50
# GPU jobs
PartitionName=gpu \
Nodes=gpu[01-08] \
MaxTime=2-00:00:00 \
State=UP \
Priority=50
# Low-priority, preemptable (uses all idle capacity)
PartitionName=preemptable \
Nodes=cn[01-64] \
MaxTime=UNLIMITED \
PreemptMode=REQUEUE \
State=UP \
Priority=1
QOS Management
Quality of Service controls per-user and per-account resource limits:
# Create standard QOS
sacctmgr add qos standard \
Priority=10 \
MaxWall=7-00:00:00 \
MaxCPUsPerUser=512 \
MaxGRESPerUser=gpu:4
# Create burst QOS (high priority, short duration)
sacctmgr add qos burst \
Priority=50 \
MaxWall=04:00:00 \
MaxJobsPerUser=2 \
Preempt=standard \
PreemptMode=requeue
# Assign QOS to an account
sacctmgr modify account research set QOS=standard,burst
# Check current QOS assignments
sacctmgr show qos format=Name,Priority,MaxWall,MaxCPUSPerUser,MaxGRES
Diagnosing Pending Jobs
# Why is job 12345 pending?
squeue -j 12345 --start
scontrol show job 12345 | grep -E "Reason|Priority|NodeList"
# Show pending jobs with reasons grouped
squeue --state=PD --format="%R" --noheader | sort | uniq -c | sort -rn
# Typical PENDING reasons and remedies:
# Resources → Not enough free nodes; wait or reduce job size
# Priority → Higher-priority jobs ahead; wait
# QOSMaxCPUPerUserLimit → User hit their QOS CPU limit
# PartitionNodeLimit → Job requests more nodes than partition allows
# ReqNodeNotAvail → Specific nodes requested are down/drained
Backfill Tuning
Backfill scheduling fills gaps in the queue by running small jobs while waiting for large jobs’ resources to become available:
# slurm.conf backfill settings
SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_test=2000,bf_resolution=300,\
bf_max_time=120,bf_continue,bf_window=2880,bf_yield_interval=2000000
bf_max_job_test=2000 means each backfill cycle evaluates up to 2000 pending jobs. bf_window=2880 looks 48 hours ahead. Larger values improve backfill utilization at the cost of higher controller CPU usage.
User Support: Common Issues
Node Failures During Jobs
# A node went down during a job — rescue the user's data
scontrol show job <jobid> | grep NodeList
# SSH to the remaining nodes and check for output files
# Drain the failed node for investigation
scontrol update NodeName=cn07 State=drain Reason="Network errors"
# After hardware fix, resume the node
scontrol update NodeName=cn07 State=resume
Debugging MPI Job Startup Failures
# Run with verbose MPI output
mpirun -np 128 -v --report-bindings ./simulation
# Check if all MPI processes started
# In the SLURM output file, count "Hello from rank X" lines
grep "Hello from rank" job_output.txt | wc -l
# Should equal --ntasks
# Check for SLURM step errors
scontrol show steps <jobid>
Storage Full — Emergency Response
# Find largest files/directories consuming space
du -sh /mnt/scratch/* | sort -rh | head -20
# Find jobs with unusually large scratch usage
for user in $(ls /mnt/scratch); do
du -sh /mnt/scratch/$user 2>/dev/null | awk -v u=$user '{print $1, u}'
done | sort -rh | head -10
# Gracefully notify affected user and extend time to clean up
# or enforce quota immediately via BeeGFS quota
beegfs-ctl --setquota --uid=$(id -u $username) \
--size 5T --files 1000000 /mnt/scratch
Maintenance Windows
Plan maintenance windows during low-utilization periods (typically weekends):
# Step 1: Drain all nodes to prevent new job starts
scontrol update PartitionName=compute State=drain
# Step 2: Wait for running jobs to complete (or set a time limit)
squeue --state=R --noheader | wc -l
# Wait until this shows 0
# Step 3: Announce maintenance (automatically, via SLURM reservation message)
scontrol create reservation \
ReservationName=maintenance-2026-06-20 \
StartTime=2026-06-20T22:00:00 \
Duration=06:00:00 \
Nodes=ALL \
Flags=MAINT,Ignore_Jobs \
Users=root
# Step 4: Perform maintenance (OS updates, hardware, firmware)
ansible compute_nodes -m shell -a "yum update -y && reboot"
# Step 5: Return nodes to service
scontrol update PartitionName=compute State=up
# Step 6: Verify all nodes came back
sinfo | grep -v IDLE | grep -v ALLOC
Scratch Cleanup Policy
Automatic scratch cleanup prevents filesystem capacity issues:
#!/bin/bash
# /usr/local/sbin/scratch-cleanup.sh
# Remove files in /scratch older than 30 days
SCRATCH_DIR=/mnt/scratch
MAX_AGE_DAYS=30
LOG_FILE=/var/log/scratch-cleanup.log
echo "$(date): Starting scratch cleanup" >> $LOG_FILE
# Find and delete files older than MAX_AGE_DAYS (but not actively in use)
find $SCRATCH_DIR -type f -atime +$MAX_AGE_DAYS -not -newer /tmp/scratch-protect \
-exec rm -f {} \; 2>> $LOG_FILE
# Remove empty directories
find $SCRATCH_DIR -type d -empty -delete 2>> $LOG_FILE
echo "$(date): Scratch cleanup complete. Space: $(df -h $SCRATCH_DIR | tail -1)" >> $LOG_FILE
Monthly KPI Review
Track these metrics monthly:
| KPI | Source | Target |
|---|---|---|
| CPU utilization % | SLURM sreport | 70–85% |
| GPU utilization % | DCGM Exporter / Grafana | > 60% |
| Median job wait time | sacct analysis | < 30 min (short jobs) |
| Jobs completed per week | sacct --state=COMPLETED | Trending up |
| Node availability | SLURM state history | > 98% |
| Storage fill rate | df trend | < 80% at month end |
| User-reported incidents | Helpdesk tickets | Trending down |
Present these monthly to HPC stakeholders. Trends over time reveal scheduling policy problems, hardware degradation, and capacity planning needs before they become crises.
Operational excellence in HPC is built from consistent processes and visibility. Contact Mevasis for HPC operations consulting, managed monitoring, and on-call support services.