HPC Energy Efficiency Guide: DVFS, nvidia-smi, PUE, Liquid Cooling

A 512-node HPC cluster with 64-core nodes drawing 500W each consumes 256 kW at full load. At $0.10/kWh, 24/7 operation, that is $224,256 per year in electricity costs — before cooling overhead. Energy efficiency is not just an environmental consideration; it directly affects the economic viability of HPC infrastructure and the amount of computing that can be run within a given power budget.

Energy Cost Analysis

Understanding where power goes is the first step:

Component	Typical Power per Node	64-Node Cluster Total
CPU (dual socket, full load)	250–700 W	16–45 kW
GPU (H100, per card × 4)	700 W × 4 = 2,800 W	179 kW
Memory (DDR5, 512 GB)	~40 W	2.5 kW
NIC/HCA (2× HDR IB)	~25 W	1.6 kW
Storage (NVMe × 4)	~25 W	1.6 kW
Idle overhead	~30%	varies

For GPU clusters, the GPU is the dominant power consumer (60–80% of node power at full load). For CPU-only clusters, the CPUs account for 50–70%.

DVFS: Dynamic Voltage and Frequency Scaling

DVFS reduces CPU power by operating at lower voltage and frequency when full performance is not needed. On Linux, the cpupower tool manages CPU frequency governors:

# Check available governors
cpupower frequency-info | grep -E "governor|frequency"

# Set to performance mode for compute jobs (maximum frequency)
cpupower frequency-set -g performance

# Set to powersave mode for idle nodes (minimum frequency)
cpupower frequency-set -g powersave

# Check current frequency across all cores
grep "cpu MHz" /proc/cpuinfo | awk '{sum += $4; n++} END {print "avg:", sum/n, "MHz"}'

The energy savings from powersave governor vs performance governor can be 20–40% at idle, with 0–5% performance impact on workloads that are memory-bandwidth or I/O bound (which are already not using peak CPU frequency).

GPU Power Management with nvidia-smi

NVIDIA GPUs support power limits via nvidia-smi:

# Check current power limit and consumption
nvidia-smi --query-gpu=name,power.limit,power.draw --format=csv

# Set power limit to 80% of TDP for thermal-limited environments
# H100 TDP: 700W -> 80% = 560W
nvidia-smi -pl 560

# For all GPUs in the system
for i in 0 1 2 3 4 5 6 7; do
    nvidia-smi -i $i -pl 560
done

Setting GPU power limits to 80% typically reduces throughput by 5–10% for memory-bandwidth-bound workloads while reducing power by 20% and temperature by 8–10°C. For thermally constrained installations, this trade-off is often worth it.

SLURM Prolog/Epilog for Automatic Power Management

Automate CPU frequency management with SLURM job hooks:

# /usr/local/sbin/slurm-job-prolog.sh
# Runs as root on compute node when a job starts
#!/bin/bash

# Switch to performance governor when a job starts
cpupower frequency-set -g performance > /dev/null 2>&1

# Enable Turbo Boost
echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || \
cpupower frequency-set --turbo-boost-on > /dev/null 2>&1

exit 0

# /usr/local/sbin/slurm-job-epilog.sh
# Runs as root on compute node after a job ends
#!/bin/bash

# Return to powersave when node becomes idle
cpupower frequency-set -g powersave > /dev/null 2>&1

exit 0

# slurm.conf
Prolog=/usr/local/sbin/slurm-job-prolog.sh
Epilog=/usr/local/sbin/slurm-job-epilog.sh
PrologFlags=Alloc

This ensures nodes run at full speed during jobs and conserve power between jobs without any user involvement.

Energy-Aware Scheduling

SLURM’s --hint=nomultithread directive disables Hyperthreading for a specific job, reducing power consumption by 10–15% for codes that do not benefit from HT:

# Submit job without hyperthreading
sbatch --hint=nomultithread my_simulation.sh

# For jobs that don't need all cores on a node, pack jobs
# to leave some nodes fully idle rather than all nodes partially loaded
# slurm.conf
SelectTypeParameters=CR_Core_Memory

Node consolidation via backfill: Rather than spreading jobs across many partially loaded nodes, configure SLURM to pack jobs onto fewer nodes and power-down unused nodes. Requires hardware support for remote power control via IPMI:

# SLURM power management with IPMI
SuspendProgram=/usr/local/sbin/suspend-node-ipmi.sh
ResumeProgram=/usr/local/sbin/resume-node-ipmi.sh
SuspendTimeout=90
ResumeTimeout=300

Dark-Hours Strategy

If the workload profile allows, schedule energy-intensive batch workloads during off-peak electricity hours (typically 10 PM – 6 AM in most markets). SLURM’s reservation system enables this:

# Create a recurring daily reservation for off-peak batch window
scontrol create \
  ReservationName=off-peak-batch \
  StartTime=22:00:00 \
  EndTime=06:00:00 \
  Duration=08:00:00 \
  Nodes=ALL \
  Flags=DAILY \
  Users=ALL

In markets with time-of-use electricity pricing, shifting 40% of compute to off-peak hours can reduce electricity cost by 25–30%.

PUE: Power Usage Effectiveness

PUE measures data center efficiency:

PUE = Total Facility Power / IT Equipment Power

PUE = 1.0: Perfect (all power goes to IT)
PUE = 1.5: Typical datacenter (50% overhead for cooling, power)
PUE = 2.0: Inefficient (equal power for IT and overhead)

Reducing PUE from 2.0 to 1.3 for a 1 MW IT load saves 700 kW — equivalent to adding 70% more compute for the same electricity bill.

Cooling Efficiency

Hot/cold aisle containment: Separate hot exhaust air from cold supply air with physical containment barriers. Prevents hot air recirculation, which forces CRAC units to over-cool. A properly contained aisle can reduce cooling energy by 20–30%.

[Cold aisle] [server fronts] → hot air → [Hot aisle] → CRAC return
           cold air ← CRAC supply ←

Raised inlet temperature: ASHRAE allows server inlet temperatures up to 27°C (A1 class) or even 35°C (A2 class). Many sites cool to 18°C by default, wasting cooling capacity. Raising the cooling setpoint from 18°C to 25°C reduces cooling energy by 30–40%.

Liquid cooling for GPU nodes: DGX H100 and HGX H100 configurations generate 1,000–4,000 W per rack unit. Air cooling at this density requires extremely high airflow (which is noisy and energy-intensive). Direct liquid cooling (rear-door heat exchangers or direct-to-chip cooling loops) removes heat at the source with 40–60% less energy than equivalent air cooling.

Monitoring Power with IPMI

# Read node power consumption via IPMI
ipmitool -H node01 -U admin -P password sdr type Power

# Log cluster-wide power to time series database
for node in $(sinfo -N --noheader -o "%N"); do
    power=$(ipmitool -H ${node}-ipmi -U admin -P password sdr type Power 2>/dev/null \
            | grep "Pwr Consumption" | awk '{print $NF}' | tr -d 'W')
    echo "hpc_node_power_watts{node=\"$node\"} $power" | \
      curl -s --data-binary @- http://pushgateway:9091/metrics/job/ipmi_power
done

Integrating power metrics into Grafana alongside SLURM job metrics enables correlation between workload type and power consumption — invaluable for TCO modeling and energy budget planning.

Energy efficiency in HPC is achievable without sacrificing performance. DVFS, GPU power limits, SLURM automation, and cooling optimization together can reduce total energy consumption by 30–50% compared to an untuned cluster. Contact Mevasis for HPC energy efficiency assessment and implementation.

HPC Energy Efficiency: DVFS, Power Management, Cooling, and PUE Optimization