Kubernetes GPU Cluster Technical Guide: GPU Operator, Volcano Scheduler, and Multi-Tenant Architecture
Kubernetes GPU cluster technical guide: Kubernetes vs SLURM comparison, GPU as extended resource (nvidia.com/gpu), NVIDIA GPU Operator components, Volcano scheduler (gang scheduling, preemption, queue, backfill), multi-tenant architecture, 4-phase installation, InfiniBand/RoCE with Multus CNI, common problems, and best practices.
Kubernetes was designed for stateless microservices, not HPC batch jobs. Yet the GPU computing workloads that now dominate AI/ML infrastructure have driven Kubernetes adoption into territory that SLURM traditionally owned. The ecosystem has responded with NVIDIA GPU Operator, Volcano scheduler, and RDMA networking support that together make Kubernetes a viable platform for GPU-accelerated HPC and AI workloads.
Kubernetes vs SLURM for GPU Workloads
| Aspect | Kubernetes | SLURM |
|---|---|---|
| Container isolation | Native (per-Pod) | Via Apptainer |
| MPI integration | Requires MPI Operator | Native srun |
| GPU resource model | Extended resource (nvidia.com/gpu) | GRES plugin |
| Job scheduling | Best-effort + priority | Full fairshare + backfill |
| Gang scheduling | Via Volcano | Native |
| Fairshare | Via Volcano Queue | Built-in |
| Multi-tenant isolation | Namespace + RBAC | Accounts + partitions |
| Infrastructure as Code | YAML manifests (GitOps) | slurm.conf |
| Ecosystem | Kubeflow, Argo, Ray, MLflow | SLURM ecosystem |
Choose Kubernetes when:
- Your team has strong Kubernetes expertise
- You run MLOps pipelines with CI/CD integration
- Container-native isolation per workload is a requirement
- You need tight Kubeflow, Ray, or Argo Workflows integration
Choose SLURM when:
- Traditional HPC workloads (MPI simulations, batch processing)
- Strong fairshare accounting requirements
- Operations team has HPC background, not Kubernetes background
- Backfill scheduling efficiency is critical
Many organizations run both: SLURM for traditional HPC, Kubernetes for AI/ML development workflows.
GPU as Extended Resource
Kubernetes represents GPUs as extended resources. A container that requests nvidia.com/gpu: 1 will be scheduled only on a node that has an available GPU:
# pod-with-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-job
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.2-runtime-ubuntu22.04
resources:
requests:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "2" # request 2 GPUs
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "2" # must equal requests for GPU
command: ["python3", "/workspace/train.py"]
nodeSelector:
gpu: "true" # schedule only on GPU nodes (via node label)
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
NVIDIA GPU Operator
Manually installing NVIDIA drivers, container toolkit, and device plugin across all GPU nodes is error-prone and creates maintenance debt. GPU Operator automates this:
# Install GPU Operator via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set mig.strategy=single \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set dcgmExporter.serviceMonitor.enabled=true
# Verify all components are running
kubectl -n gpu-operator get pods
GPU Operator manages:
- NVIDIA Driver: Installed as a container (no host OS modification)
- CUDA Toolkit: Available to all Pods requesting GPUs
- Container Toolkit: Configures containerd/Docker for GPU access
- Device Plugin: Advertises
nvidia.com/gpuresources to Kubernetes - DCGM Exporter: GPU metrics (temperature, utilization, ECC errors) for Prometheus
- MIG Manager: Partitions A100/H100 into MIG instances
MIG (Multi-Instance GPU) configuration:
# ConfigMap for MIG profile on H100 nodes
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-balanced:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 2 # 2 instances at 10 GB each
"2g.20gb": 2 # 2 instances at 20 GB each
"3g.40gb": 1 # 1 instance at 40 GB
Volcano Scheduler
The default Kubernetes scheduler is not designed for HPC batch workloads. It cannot do gang scheduling, fairshare, or backfill. Volcano (CNCF project) adds these capabilities:
# Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
# Verify Volcano scheduler is running
kubectl -n volcano-system get pods
Gang Scheduling
Gang scheduling ensures that all Pods of a distributed training job start simultaneously — or none start. Without gang scheduling, some Pods start and consume GPU resources while waiting for other Pods, causing deadlock or severe resource waste.
# PyTorch distributed training job with gang scheduling
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: pytorch-train
spec:
minAvailable: 8 # gang: all 8 workers must start together, or none
schedulerName: volcano
queue: research-queue
tasks:
- replicas: 8
name: worker
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: "1"
command:
- torchrun
- --nnodes=8
- --nproc_per_node=1
- train.py
If 7 of 8 worker Pods would fit on available nodes but the 8th would not, Volcano holds all 8 rather than starting a partial job that would deadlock.
Queue Management
Volcano queues implement team-level resource quotas:
# Define queues with resource weights
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: research-team-a
spec:
weight: 40 # 40% of cluster resources
capability:
nvidia.com/gpu: "32" # hard limit: max 32 GPUs for this queue
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: research-team-b
spec:
weight: 35
capability:
nvidia.com/gpu: "24"
Preemption
Higher-priority jobs can preempt lower-priority ones:
# High-priority job that can preempt
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: urgent-inference
spec:
priorityClassName: high-priority # must be defined as PriorityClass
preemptable: true
...
Multi-Tenant Architecture
For multi-tenant GPU clusters, combine namespace isolation, RBAC, ResourceQuota, and Volcano:
# Namespace for Team A
apiVersion: v1
kind: Namespace
metadata:
name: team-a
---
# Resource quota for Team A's namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-a-quota
namespace: team-a
spec:
hard:
requests.nvidia.com/gpu: "16"
limits.nvidia.com/gpu: "16"
requests.memory: "512Gi"
persistentvolumeclaims: "20"
---
# LimitRange ensures Pods don't request unlimited memory
apiVersion: v1
kind: LimitRange
metadata:
name: team-a-limits
namespace: team-a
spec:
limits:
- type: Container
default:
memory: "8Gi"
defaultRequest:
memory: "4Gi"
max:
memory: "128Gi"
nvidia.com/gpu: "8"
Installation: 4-Phase Process
Phase 1 — Requirements and design:
- Determine GPU node count and models (H100, A100, L40S)
- Choose single-zone or multi-zone topology
- Select network plugin: Calico/Cilium for standard, Multus + SR-IOV for InfiniBand
- Define tenant structure (teams, quotas)
Phase 2 — Node preparation:
- Install Ubuntu 22.04 or RHEL 9 on all nodes
- Configure BIOS: IOMMU enabled, SR-IOV enabled (for InfiniBand passthrough)
- Install MLNX_OFED if using InfiniBand
- Verify NTP synchronization
Phase 3 — Kubernetes and component installation:
# Install Kubernetes (kubeadm method)
kubeadm init --pod-network-cidr=10.244.0.0/16 \
--apiserver-advertise-address=<control-plane-ip>
# Install CNI
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
# Join worker nodes
kubeadm join <control-plane-ip>:6443 --token <token> --discovery-token-ca-cert-hash <hash>
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
# Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
# Install monitoring (Prometheus + Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring
Phase 4 — Validation and training:
# Verify GPU resources visible
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
# Run NCCL all-reduce test across 4 nodes
kubectl apply -f nccl-test-job.yaml
# Check GPU metrics
kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80
# Open DCGM dashboard in Grafana
InfiniBand with Multus CNI
For distributed training that requires InfiniBand RDMA:
# NetworkAttachmentDefinition for InfiniBand
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: ib-network
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "ib-network",
"type": "sriov",
"deviceID": "0000:02:00.0",
"ipam": {
"type": "host-local",
"subnet": "192.168.20.0/24"
}
}
---
# Pod with both standard network and InfiniBand
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: ib-network
spec:
containers:
- name: nccl-test
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: "1"
rdma/hca: "1" # InfiniBand HCA resource
Common Problems
GPU Operator DaemonSet Pods CrashLoopBackOff:
kubectl -n gpu-operator describe pod nvidia-driver-daemonset-xxx
# Often: driver version mismatch with kernel
# Fix: Update ClusterPolicy to match running kernel version
kubectl -n gpu-operator edit clusterpolicy
# Update spec.driver.version to match 'uname -r'
Volcano Job stuck in Pending:
kubectl describe vcjob pytorch-train
# Common reasons:
# minAvailable not met: not enough GPU nodes available
# Queue resource limit hit: team has exceeded quota
# Check queue status
kubectl get queue
kubectl describe queue research-queue
Namespace isolation violations:
# Check if LimitRange is applied
kubectl -n team-a describe limitrange team-a-limits
# Without LimitRange, Pods can request unlimited resources
# even with ResourceQuota defined
kubectl -n team-a get resourcequota -o yaml
For Kubernetes GPU cluster deployment, GPU Operator configuration, and Volcano scheduler tuning, contact Mevasis.