/ Blog

HPC Cluster Deployment Checklist: 7-Phase Pre-Production Verification

Complete HPC cluster deployment checklist across 7 phases: pre-installation planning, hardware installation, software configuration, storage verification, security and access control, monitoring and alerting, and operational readiness.

A checklist is not a sign of distrust in expertise — it is how aviation, surgery, and nuclear power consistently avoid catastrophic failures. HPC cluster deployment has enough independent steps across hardware, network, storage, and software that skipping even one can cause hours of debugging or force a complete rebuild. Use this checklist for every new cluster deployment and major upgrade.

Phase 1: Pre-Installation Planning

Architecture and Design

  • Workload analysis completed: application types, job sizes, I/O patterns documented
  • Node count, memory, and storage sized to workload requirements
  • Network topology designed (fat-tree port count, InfiniBand vs. RoCE decision)
  • Storage tier design complete (scratch, project, archive capacities)
  • Management and HA strategy defined (slurmctld HA, storage mirroring)
  • IP address plan and VLAN assignment documented
  • Security requirements reviewed (compliance, data classification)

Physical Infrastructure

  • Rack space reserved and labeled
  • Power circuits provisioned and PDUs ordered
  • Network cabling plan reviewed with datacenter team
  • Cooling capacity verified for peak GPU/CPU thermal load
  • Physical access and delivery scheduling confirmed

Phase 2: Hardware Installation

Server Installation

  • Servers racked and mounted per rack diagram
  • Power cables connected and redundant PSU verified
  • Network cables connected per cabling plan and labeled
  • InfiniBand cables connected and transceivers verified
  • BIOS/UEFI updated to latest vendor-recommended version
  • BIOS settings optimized: NUMA, IOMMU, SR-IOV, C-states, Turbo as appropriate
  • IPMI/BMC configured with static IP and admin credentials
  • Remote console access verified on all nodes

Network Hardware

  • InfiniBand switches powered and cabled
  • Switch firmware updated to recommended version
  • Ethernet switches powered and uplinks verified
  • Management VLAN, compute VLAN, storage VLAN configured on switches
  • MTU 9000 (jumbo frames) enabled on compute and storage switch ports

Phase 3: Software Configuration

Base OS

  • Operating system installed on all nodes via PXE/network boot
  • OS version identical across all compute nodes
  • NTP/Chrony configured and synchronized (< 100 ms offset from source)
  • SSH configured: key authentication only, password disabled, SSH host keys distributed
  • /etc/hosts or DNS entries for all cluster nodes verified
  • Firewall rules configured: only required ports open between network segments

Authentication

  • LDAP or FreeIPA server deployed and functional
  • All cluster nodes joined to LDAP/FreeIPA domain
  • User creation, home directory provisioning tested
  • MUNGE installed and identical key distributed to all nodes
  • MUNGE service running on all nodes: munge -n | unmunge passes

SLURM

  • SLURM packages installed: slurmctld, slurmd, slurmdbd, munge
  • /etc/slurm/slurm.conf reviewed and validated with slurmctld -D -v
  • Partition definitions match intended workload profiles
  • slurmctld running on controller node
  • slurmd running on all compute nodes: sinfo shows all nodes IDLE
  • slurmdbd running and connected to controller
  • Test job submission: sbatch --wrap="hostname" runs and completes
  • GRES (GPU) configuration verified: sinfo -o "%N %G" shows correct GPU count

Phase 4: Storage Verification

Parallel Filesystem (BeeGFS/Lustre)

  • Management service (mgmtd) running and accessible
  • All metadata servers registered: beegfs-ctl --listnodes --nodetype=meta
  • All storage servers registered: beegfs-ctl --listnodes --nodetype=storage
  • Filesystem mounted on all compute nodes: df -h /mnt/scratch shows expected size
  • Write permission test: all users can write to their scratch directory
  • Quota system configured and enforced (where applicable)

I/O Benchmark

  • IOR sequential write benchmark run and results recorded
  • IOR sequential read benchmark run and results recorded
  • Aggregate bandwidth meets design specification (within 20%)
  • mdtest small-file metadata performance verified

Backup

  • Backup system configured for home directories and project storage
  • First backup run completed successfully
  • Restore test: random file restored from backup and verified

Phase 5: Security and Access Control

Network Security

  • Management network accessible only from admin hosts
  • Compute nodes not directly accessible from external network
  • InfiniBand / compute network isolated from management and external
  • IPMI/BMC management interface on dedicated out-of-band network

Authentication and Authorization

  • Root SSH login disabled on all nodes
  • Only authorized admin keys in root ~/.ssh/authorized_keys
  • sudo configured with principle of least privilege
  • User home directory permissions correct (mode 700 or 750)
  • /etc/sudoers reviewed and non-default rules documented

Software Security

  • All nodes patched to current security errata
  • Unneeded services disabled (Bluetooth, avahi, cups, etc.)
  • SELinux or AppArmor policy reviewed (enforcing or permissive, documented)
  • Auditd configured and logging to central syslog

Phase 6: Monitoring and Alerting

Metrics Collection

  • Node Exporter deployed on all compute nodes
  • SLURM Exporter deployed on controller node
  • DCGM Exporter deployed on all GPU nodes
  • Prometheus scraping all exporters (verify with curl prometheus:9090/targets)
  • Grafana dashboards: cluster overview, GPU detail, SLURM queue, storage I/O

Alert Rules

  • Alert: node down (unreachable for > 5 minutes)
  • Alert: GPU temperature > 85°C for > 5 minutes
  • Alert: GPU ECC double-bit errors detected
  • Alert: storage target unavailable
  • Alert: queue pending time > threshold (per cluster SLA)
  • Alert: disk usage > 85% on management and storage nodes
  • Alert notification channels (email, Slack, PagerDuty) tested

Phase 7: Operational Readiness

Documentation

  • Network diagram (physical and logical) current and stored in shared location
  • IPMI/BMC credentials in password manager (not in a spreadsheet)
  • SLURM configuration changes tracked in version control (Git)
  • Operating runbook: daily checks, common procedures, escalation contacts
  • User documentation: cluster access, job submission, software modules

Training and Handoff

  • System administrators trained on cluster procedures
  • First-line support staff know how to check SLURM queue and node status
  • User onboarding documentation provided to pilot users
  • Feedback mechanism established for user issues

Go-Live

  • Pilot test with 3–5 users and representative workloads for 2 weeks
  • No open P1 or P2 issues from pilot
  • Monitoring alerts tested end-to-end (fire alert, verify notification received)
  • Backup restore tested and documented
  • SLURM HA failover tested (if configured)
  • Sign-off from system owner before production announcement

A completed checklist does not guarantee a perfect cluster, but it dramatically reduces the probability of avoidable failures in the first production weeks. Adapt this template to your organization’s specific requirements and compliance obligations.

For HPC cluster deployment and commissioning services, contact Mevasis.