/ Blog

HPC Cluster Observability: End-to-End Monitoring with Prometheus, Grafana, and DCGM

End-to-end HPC cluster observability using DCGM Exporter, SLURM Exporter, Prometheus, and Grafana: four-layer monitoring architecture, Prometheus configuration, critical GPU and SLURM alert rules, Grafana dashboard strategy, common problems, and deployment sequence.

In HPC environments, the question must go beyond “did something break?” Hundreds of nodes and thousands of concurrent jobs demand more than failure detection — you need to see immediately which job was affected, which hardware component triggered the failure, and what the impact on the entire system looks like. This capability is what we call observability.

Architecture: Four-Layer Stack

The industry-standard HPC observability stack consists of four layers:

Data Collection Layer consists of three core exporters:

  • DCGM Exporter collects GPU utilization, memory, temperature, power consumption, and ECC error counters from NVIDIA GPUs. Recommended scrape interval: 10 seconds.
  • SLURM Exporter pulls queue depth, job state, node state, and resource allocation data from the job scheduler. 30-second intervals are sufficient.
  • Node Exporter gathers CPU, RAM, disk, and network metrics from every node at 15-second intervals.

Storage and Query Layer is served by Prometheus. Prometheus scrapes all exporters, writes to its time-series database, and makes data queryable via PromQL. Retention period and storage size must be calculated based on cluster scale — a 100-node installation generates approximately 10–15 GB of data daily.

Visualization Layer is Grafana. Connected to Prometheus as a data source, Grafana provides interactive dashboards and alert visualization.

Notification Layer is Alertmanager. It consolidates alerts produced by Prometheus, applies silence and routing rules, and forwards them to channels like email, Slack, or PagerDuty.

Core Prometheus Configuration

Configure global scrape intervals and separate job definitions for each exporter in prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "rules/gpu_alerts.yml"
  - "rules/slurm_alerts.yml"

scrape_configs:
  - job_name: "dcgm"
    static_configs:
      - targets: ["gpu-node-01:9400", "gpu-node-02:9400"]
  - job_name: "slurm"
    static_configs:
      - targets: ["slurm-master:8080"]
  - job_name: "node"
    static_configs:
      - targets: ["gpu-node-01:9100", "gpu-node-02:9100"]

For large clusters, using Prometheus’s service discovery mechanisms (file_sd or DNS-SD) instead of static target lists significantly reduces maintenance burden.

Critical Alert Rules

An effective alerting system monitors not just instantaneous thresholds but trends. The three most important GPU rules:

  • GPUTemperatureHigh: If DCGM_FI_DEV_GPU_TEMP > 85 is true for 5 minutes, a warning alert is generated.
  • GPUMemoryNearFull: If DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95 persists for 2 minutes, a critical alert fires.
  • ECCErrorDetected: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 0 immediately generates a critical alert; uncorrectable memory errors may require hardware replacement.
# rules/gpu_alerts.yml
groups:
  - name: gpu_health
    rules:
      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} on {{ $labels.instance }} temperature: {{ $value }}°C"

      - alert: GPUMemoryNearFull
        expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU {{ $labels.gpu }} memory at {{ $value | humanizePercentage }}"

      - alert: ECCErrorDetected
        expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Uncorrected ECC error on GPU {{ $labels.gpu }} at {{ $labels.instance }}"

On the SLURM side, alert rules for queue depth and idle nodes form the foundation of capacity planning.

Grafana Dashboard Strategy

Four dashboard groups are recommended for every installation:

Cluster Overview Dashboard presents a GPU/CPU utilization heatmap for all nodes, total resource allocation rate, and real-time power consumption on a single screen. This dashboard lets administrators grasp operational status in seconds.

GPU Detail Dashboard shows per-node temperature trends, memory bandwidth utilization, NVLink/PCIe traffic ratios, and ECC error history. When a job is progressing slower than expected, this dashboard immediately reveals whether the problem is at the node level or GPU level.

SLURM Job Analytics Dashboard provides resource consumption reports by user and project and job completion time distributions — critical for proving SLA compliance.

Network and Storage Dashboard monitors InfiniBand or Ethernet bandwidth alongside BeeGFS/Lustre read/write performance.

Common Problems and Solutions

DCGM Exporter not producing data: Ensure the NVIDIA driver version and DCGM Exporter version are compatible. The dcgmi discovery -l command verifies that GPUs are visible to DCGM.

Prometheus disk exhaustion: The default 15-day retention can cause disk problems on large clusters. Use --storage.tsdb.retention.size to set a size limit, and enable --storage.tsdb.wal-compression for compression.

Alertmanager alert storm: Dozens of alerts from the same underlying issue cause operator fatigue. Configure group_by and group_wait parameters to consolidate similar alerts. Define silence rules for known maintenance windows.

SLURM Exporter permission error: The exporter must have access to the SLURM accounting database. The slurm_exporter user must have the necessary roles assigned in sacctmgr show user output.

Deployment Sequence

When building an observability stack from scratch:

  1. Deploy Node Exporter and Prometheus first; verify the basic metric stream.
  2. Add DCGM Exporter to GPU nodes; test that GPU metrics reach Prometheus with a PromQL query.
  3. Add SLURM Exporter last — it has the most dependency on SLURM configuration.
  4. Once all data sources are stable, create Grafana dashboards and Alertmanager rules.

Conclusion

Comprehensive observability for your HPC cluster shortens outage duration, increases GPU resource efficiency, and lets you prove SLA compliance with data. For detailed information about Mevasis’s end-to-end HPC observability solution, visit /solutions/observability/. To request a customized architecture and pricing proposal for your existing infrastructure, contact us.