HPC Cluster Observability: End-to-End Monitoring with Prometheus, Grafana, and DCGM
End-to-end HPC cluster observability using DCGM Exporter, SLURM Exporter, Prometheus, and Grafana: four-layer monitoring architecture, Prometheus configuration, critical GPU and SLURM alert rules, Grafana dashboard strategy, common problems, and deployment sequence.
In HPC environments, the question must go beyond “did something break?” Hundreds of nodes and thousands of concurrent jobs demand more than failure detection — you need to see immediately which job was affected, which hardware component triggered the failure, and what the impact on the entire system looks like. This capability is what we call observability.
Architecture: Four-Layer Stack
The industry-standard HPC observability stack consists of four layers:
Data Collection Layer consists of three core exporters:
- DCGM Exporter collects GPU utilization, memory, temperature, power consumption, and ECC error counters from NVIDIA GPUs. Recommended scrape interval: 10 seconds.
- SLURM Exporter pulls queue depth, job state, node state, and resource allocation data from the job scheduler. 30-second intervals are sufficient.
- Node Exporter gathers CPU, RAM, disk, and network metrics from every node at 15-second intervals.
Storage and Query Layer is served by Prometheus. Prometheus scrapes all exporters, writes to its time-series database, and makes data queryable via PromQL. Retention period and storage size must be calculated based on cluster scale — a 100-node installation generates approximately 10–15 GB of data daily.
Visualization Layer is Grafana. Connected to Prometheus as a data source, Grafana provides interactive dashboards and alert visualization.
Notification Layer is Alertmanager. It consolidates alerts produced by Prometheus, applies silence and routing rules, and forwards them to channels like email, Slack, or PagerDuty.
Core Prometheus Configuration
Configure global scrape intervals and separate job definitions for each exporter in prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "rules/gpu_alerts.yml"
- "rules/slurm_alerts.yml"
scrape_configs:
- job_name: "dcgm"
static_configs:
- targets: ["gpu-node-01:9400", "gpu-node-02:9400"]
- job_name: "slurm"
static_configs:
- targets: ["slurm-master:8080"]
- job_name: "node"
static_configs:
- targets: ["gpu-node-01:9100", "gpu-node-02:9100"]
For large clusters, using Prometheus’s service discovery mechanisms (file_sd or DNS-SD) instead of static target lists significantly reduces maintenance burden.
Critical Alert Rules
An effective alerting system monitors not just instantaneous thresholds but trends. The three most important GPU rules:
- GPUTemperatureHigh: If
DCGM_FI_DEV_GPU_TEMP > 85is true for 5 minutes, awarningalert is generated. - GPUMemoryNearFull: If
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95persists for 2 minutes, acriticalalert fires. - ECCErrorDetected:
increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 0immediately generates acriticalalert; uncorrectable memory errors may require hardware replacement.
# rules/gpu_alerts.yml
groups:
- name: gpu_health
rules:
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} on {{ $labels.instance }} temperature: {{ $value }}°C"
- alert: GPUMemoryNearFull
expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "GPU {{ $labels.gpu }} memory at {{ $value | humanizePercentage }}"
- alert: ECCErrorDetected
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 0
labels:
severity: critical
annotations:
summary: "Uncorrected ECC error on GPU {{ $labels.gpu }} at {{ $labels.instance }}"
On the SLURM side, alert rules for queue depth and idle nodes form the foundation of capacity planning.
Grafana Dashboard Strategy
Four dashboard groups are recommended for every installation:
Cluster Overview Dashboard presents a GPU/CPU utilization heatmap for all nodes, total resource allocation rate, and real-time power consumption on a single screen. This dashboard lets administrators grasp operational status in seconds.
GPU Detail Dashboard shows per-node temperature trends, memory bandwidth utilization, NVLink/PCIe traffic ratios, and ECC error history. When a job is progressing slower than expected, this dashboard immediately reveals whether the problem is at the node level or GPU level.
SLURM Job Analytics Dashboard provides resource consumption reports by user and project and job completion time distributions — critical for proving SLA compliance.
Network and Storage Dashboard monitors InfiniBand or Ethernet bandwidth alongside BeeGFS/Lustre read/write performance.
Common Problems and Solutions
DCGM Exporter not producing data: Ensure the NVIDIA driver version and DCGM Exporter version are compatible. The dcgmi discovery -l command verifies that GPUs are visible to DCGM.
Prometheus disk exhaustion: The default 15-day retention can cause disk problems on large clusters. Use --storage.tsdb.retention.size to set a size limit, and enable --storage.tsdb.wal-compression for compression.
Alertmanager alert storm: Dozens of alerts from the same underlying issue cause operator fatigue. Configure group_by and group_wait parameters to consolidate similar alerts. Define silence rules for known maintenance windows.
SLURM Exporter permission error: The exporter must have access to the SLURM accounting database. The slurm_exporter user must have the necessary roles assigned in sacctmgr show user output.
Deployment Sequence
When building an observability stack from scratch:
- Deploy Node Exporter and Prometheus first; verify the basic metric stream.
- Add DCGM Exporter to GPU nodes; test that GPU metrics reach Prometheus with a PromQL query.
- Add SLURM Exporter last — it has the most dependency on SLURM configuration.
- Once all data sources are stable, create Grafana dashboards and Alertmanager rules.
Conclusion
Comprehensive observability for your HPC cluster shortens outage duration, increases GPU resource efficiency, and lets you prove SLA compliance with data. For detailed information about Mevasis’s end-to-end HPC observability solution, visit /solutions/observability/. To request a customized architecture and pricing proposal for your existing infrastructure, contact us.