HPC KPI Guide: Utilization, Throughput, Queue Time, Prometheus, Grafana

You cannot improve what you do not measure. HPC cluster managers who rely on intuition rather than data consistently miss optimization opportunities and fail to detect degradation before users complain. A structured KPI framework gives HPC operations teams the visibility to make evidence-based decisions about scheduling policies, hardware expansion, and operational priorities.

The Five KPI Categories

1. Resource Utilization

CPU Utilization (%): Fraction of available CPU cores that are allocated to running jobs.

CPU Utilization = (Allocated Core-Hours) / (Total Available Core-Hours) × 100

Target: 70–85%. Below 70% suggests insufficient workload, over-provisioning, or scheduling inefficiency. Above 90% sustained means the cluster is saturated and queue wait times are rising.

GPU Utilization (%): Fraction of time each GPU’s CUDA cores are active (measured by DCGM):

# DCGM metric: DCGM_FI_DEV_GPU_UTIL
# Per-GPU time-averaged from DCGM Exporter
nvidia-smi dmon -s u -d 5 -c 12

GPU utilization below 60% for training jobs often indicates a data loading bottleneck (storage or preprocessing can’t feed the GPU fast enough).

Memory Utilization per Job: Fraction of requested memory actually used:

# Check memory efficiency of a completed job
seff 12345
# Output shows:
#   Memory Utilized: 124.3 GB
#   Memory Efficiency: 81.03% of 153.5 GB

Memory efficiency below 50% indicates over-requesting, which wastes capacity. Configure memory-based accounting: AccountingStorageTRES=cpu,mem,node,gres/gpu.

2. Job Throughput

Jobs Completed per Day: Total number of jobs transitioned to COMPLETED state in a 24-hour period:

sacct --starttime=$(date -d "24 hours ago" +%Y-%m-%dT%H:%M) \
      --endtime=now \
      --state=COMPLETED \
      --format=JobID \
      --noheader | wc -l

Trending over time detects changes in research productivity and workload composition.

Core-Hours Delivered per Week: Total compute work completed:

sacct --starttime=2026-06-12 \
      --endtime=2026-06-19 \
      --state=COMPLETED \
      --format=CPUTimeRAW \
      --noheader | \
  awk '{sum += $1/3600} END {printf "%.0f core-hours\n", sum}'

3. Queue Wait Time

Median Job Wait Time: Time from submission to start for the median job:

sacct --starttime=2026-06-01 \
      --state=COMPLETED \
      --format=Submit,Start,Elapsed \
      --noheader | \
  awk '{
    split($1, submit, "T"); split($2, start, "T");
    # Calculate wait time in minutes (simplified)
    print NR, $1, $2
  }' | head -20

In Prometheus/Grafana:

# Average job wait time in seconds (from SLURM exporter)
avg(slurm_job_wait_time_seconds) by (partition)

Target values by job size:

Job Size	Acceptable Wait Time
< 1 node, < 1h	< 15 minutes (80th percentile)
1–8 nodes, < 4h	< 1 hour (80th percentile)
> 8 nodes, > 4h	< 4 hours (80th percentile)

4. System Efficiency

SLURM Scheduling Efficiency: The seff command shows job-level CPU and memory efficiency:

$ seff 9991234
Job ID: 9991234
Cluster: hpc-cluster
User/Group: alice/research
State: COMPLETED (exit code 0)
Cores: 64
CPU Utilized: 02-19:25:42
CPU Efficiency: 93.34% of 02-21:04:16 core-walltime
Job Wall-clock time: 01:01:24
Memory Utilized: 186.5 GB
Memory Efficiency: 72.86% of 256.0 GB

CPU efficiency below 80% for parallel jobs indicates:

MPI communication overhead too high (InfiniBand problem)
Load imbalance across MPI ranks
I/O blocking (storage bottleneck)

Cluster Throughput Efficiency: How much of theoretical peak throughput is delivered as useful computation:

Efficiency = (CPU-Hours from COMPLETED jobs) / (Total CPU-Hours available) × 100

This differs from utilization — a node running an I/O-bound job at 40% CPU has 100% of its hours “allocated” but only 40% efficiency.

5. Availability and SLA

Mean Time Between Failures (MTBF): Average time between node failures:

# Count drain events from SLURM logs
grep "DRAIN" /var/log/slurmctld.log | \
  awk '{print $1, $2}' | \
  sort | uniq -c

System Availability (%):

Availability = (Scheduled Uptime - Unplanned Downtime) / Scheduled Uptime × 100

Target: 99.0–99.9% for research clusters. For production HPC supporting business processes, 99.9% or higher.

SLURM Node State Distribution:

# Quick cluster health snapshot
sinfo --format="%T %n" | sort | uniq -c | sort -rn
# Shows count of nodes in each state: IDLE, ALLOCATED, DRAIN, DOWN

Prometheus + Grafana + XDMoD Monitoring Stack

Prometheus collects metrics from three exporters:

# prometheus.yml
scrape_configs:
  - job_name: slurm
    static_configs:
      - targets: ['slurm-controller:9341']
    scrape_interval: 30s

  - job_name: dcgm
    static_configs:
      - targets: ['gpu-node-01:9400', 'gpu-node-02:9400']
    scrape_interval: 10s

  - job_name: node
    static_configs:
      - targets: ['cn01:9100', 'cn02:9100']
    scrape_interval: 15s

Key PromQL queries for dashboards:

# Cluster-wide CPU utilization
sum(slurm_cpus_allocated) / sum(slurm_cpus_total) * 100

# Jobs pending for more than 1 hour
count(slurm_job_state{state="pending"} and slurm_job_wait_time_seconds > 3600)

# GPU memory utilization average
avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100

# Nodes in DRAIN state (problem indicator)
count(slurm_node_state{state="drain"})

XDMoD (Open XDMoD) provides HPC-specific analytics that Grafana alone does not easily support:

Job efficiency reports by user, account, and application
Resource allocation vs. actual usage comparison
Wait time trends and SLA compliance reporting
User-level CPU efficiency histograms

# Import SLURM accounting data into XDMoD
xdmod-shredder -r hpc-cluster -t slurm -d 2026-06-01:2026-06-19
xdmod-ingestor

Common Measurement Mistakes

Mistake 1: Measuring allocation instead of utilization. An allocated node that runs a single-threaded job shows 0% CPU efficiency but 100% “utilization” by allocation metrics. Track both.

Mistake 2: Ignoring queue wait time distribution. Reporting mean wait time hides bi-modal distributions where small jobs wait < 5 minutes but large jobs wait > 24 hours. Report 50th, 90th, and 99th percentiles.

Mistake 3: Not separating GPU utilization from GPU allocation. A GPU allocated to a job that has not yet started data loading shows 0% utilization. Poor pre-processing code can waste GPU cycles on an “allocated” and “running” job.

Mistake 4: Capacity planning based on peak, not average. 98th percentile utilization of 95% does not mean you need double the capacity — it means you need 5–10% more capacity to reduce queue wait time during peak periods.

A KPI framework is only useful if reviewed regularly and acted upon. Establish a monthly operations review that includes cluster utilization, queue performance, and top-5 inefficiency findings. Contact Mevasis for HPC monitoring setup, XDMoD deployment, and operations metrics consulting.

HPC Cluster KPIs: Measuring Resource Utilization, Queue Performance, and SLA Compliance