OpenStack HPC Technical Guide: Installation, SLURM Integration, and Best Practices
OpenStack HPC infrastructure guide: kolla-ansible container-native deployment, SLURM integration for hybrid scheduling, Ironic bare-metal provisioning, BeeGFS storage configuration, common problems (NUMA mismatch, MTU issues, Heat stack conflicts), and production best practices.
OpenStack unifies open-source components under a single IaaS platform, giving enterprise HPC infrastructure multi-tenant resource management and cloud-like flexibility. This guide covers the key concepts of OpenStack HPC architecture, the installation process, common problems, and best practices.
Core Components and Their HPC Roles
OpenStack HPC depends on coordinated operation of several critical services:
Nova is the compute orchestration layer, scheduling both VM and bare-metal workloads. Ironic is the bare-metal provisioning service that brings physical HPC nodes into the OpenStack ecosystem — the preferred approach for large-scale simulations requiring InfiniBand. Neutron creates isolated VLAN or VXLAN segments per project at the network virtualization layer; combined with SR-IOV-capable NICs, it eliminates software-based switch latency. Keystone connects enterprise identity infrastructure to the platform via LDAP or Active Directory federation.
On the storage side, BeeGFS parallel filesystem is deployed as the POSIX layer for scratch and working directories, while Cinder NVMe volumes are allocated for VM boot and database workloads. This separation prevents HPC jobs requiring parallel I/O from hitting storage bottlenecks.
Installation with kolla-ansible
Mevasis’s container-native deployment approach runs each OpenStack service in a separate container. This architecture significantly simplifies platform updates, horizontal scaling, and failure detection.
Before installation, define an HPC-focused profile in globals.yml. Ironic and Heat must be enabled, NumaTopologyFilter and PciPassthroughFilter added to the Nova scheduler, and object storage should be disabled if BeeGFS is being used instead of Swift. Correct ordering of Nova scheduler filters is critical: incorrect filter ordering can cause nodes with GPUs or InfiniBand cards to be unexpectedly opened to unintended workloads.
During Ironic node registration, note that PXE boot cannot complete without correct IPMI credentials and MAC addresses. Additionally, the deployment network used by Ironic must be separated from the HPC data network; otherwise PXE traffic will negatively impact MPI communication.
Hybrid Scheduling with SLURM
One of OpenStack HPC’s most powerful features is its interoperability with SLURM. In this integration, SLURM manages fixed-capacity partitions while pending jobs (PD state) in an overloaded queue are routed to dynamically provisioned new nodes via the OpenStack API. When jobs complete, nodes are automatically deleted, preventing resource waste.
For this architecture to work reliably, SLURM prolog and epilog scripts must manage OpenStack session tokens. Token expiration causes errors during node provisioning for long-pending jobs. Configuring the Keystone token lifetime to match the longest possible job duration prevents this issue.
Common Problems and Solutions
NUMA Mismatch: When VMs are placed without regard for NUMA boundaries, memory bandwidth drops significantly. Ensure NumaTopologyFilter is active and flavors are defined with the hw:numa_nodes property.
Ironic Node Cleanup Timeout: For large NVMe disks, the default disk wiping time may be insufficient. Extend the cleanup step timeout with [conductor] clean_step_priority_override.
Neutron MTU Mismatch: If jumbo frames (MTU 9000) are used on the HPC data network, Neutron network definitions must match the physical NIC MTU values of compute nodes. Mismatches cause packet loss in MPI communication.
Heat Stack Update Conflicts: When large cluster templates are updated, the Nova scheduler receiving too many simultaneous provisioning requests can cause resource conflicts. Apply template updates in small groups using a rolling update strategy.
Best Practices
In production environments, run the OpenStack control plane in high-availability (HA) mode across three physical controller nodes — a single node failure should never result in platform downtime. Prefer SAML or Kerberos over LDAP for Keystone federation; this avoids storing user passwords in the OpenStack database.
Transport BeeGFS and Cinder storage layers over separate physical network cards. Running both parallel filesystem traffic and Cinder replication over a single NIC degrades both. For GPU-equipped nodes, use pci_passthrough:alias in flavor design; this ensures the GPU resource pool is correctly controlled per project.
In capacity planning, remember that controller nodes do not scale linearly with compute node count: Galera cluster and RabbitMQ sizing becomes especially critical in environments with more than 100 compute nodes.
OpenStack HPC provides a powerful foundation for multi-tenant research environments and hybrid AI/HPC workloads. For organization-specific architecture assessment and turnkey installation services, visit our OpenStack HPC solution page or contact us directly.