/ Blog

Kubernetes GPU Cluster Technical Guide: GPU Operator, Volcano Scheduler, and Multi-Tenant Architecture

Kubernetes GPU cluster technical guide: Kubernetes vs SLURM comparison, GPU as extended resource (nvidia.com/gpu), NVIDIA GPU Operator components, Volcano scheduler (gang scheduling, preemption, queue, backfill), multi-tenant architecture, 4-phase installation, InfiniBand/RoCE with Multus CNI, common problems, and best practices.

Kubernetes was designed for stateless microservices, not HPC batch jobs. Yet the GPU computing workloads that now dominate AI/ML infrastructure have driven Kubernetes adoption into territory that SLURM traditionally owned. The ecosystem has responded with NVIDIA GPU Operator, Volcano scheduler, and RDMA networking support that together make Kubernetes a viable platform for GPU-accelerated HPC and AI workloads.

Kubernetes vs SLURM for GPU Workloads

AspectKubernetesSLURM
Container isolationNative (per-Pod)Via Apptainer
MPI integrationRequires MPI OperatorNative srun
GPU resource modelExtended resource (nvidia.com/gpu)GRES plugin
Job schedulingBest-effort + priorityFull fairshare + backfill
Gang schedulingVia VolcanoNative
FairshareVia Volcano QueueBuilt-in
Multi-tenant isolationNamespace + RBACAccounts + partitions
Infrastructure as CodeYAML manifests (GitOps)slurm.conf
EcosystemKubeflow, Argo, Ray, MLflowSLURM ecosystem

Choose Kubernetes when:

  • Your team has strong Kubernetes expertise
  • You run MLOps pipelines with CI/CD integration
  • Container-native isolation per workload is a requirement
  • You need tight Kubeflow, Ray, or Argo Workflows integration

Choose SLURM when:

  • Traditional HPC workloads (MPI simulations, batch processing)
  • Strong fairshare accounting requirements
  • Operations team has HPC background, not Kubernetes background
  • Backfill scheduling efficiency is critical

Many organizations run both: SLURM for traditional HPC, Kubernetes for AI/ML development workflows.

GPU as Extended Resource

Kubernetes represents GPUs as extended resources. A container that requests nvidia.com/gpu: 1 will be scheduled only on a node that has an available GPU:

# pod-with-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-job
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.2-runtime-ubuntu22.04
    resources:
      requests:
        memory: "32Gi"
        cpu: "8"
        nvidia.com/gpu: "2"      # request 2 GPUs
      limits:
        memory: "32Gi"
        cpu: "8"
        nvidia.com/gpu: "2"      # must equal requests for GPU
    command: ["python3", "/workspace/train.py"]
  nodeSelector:
    gpu: "true"                  # schedule only on GPU nodes (via node label)
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

NVIDIA GPU Operator

Manually installing NVIDIA drivers, container toolkit, and device plugin across all GPU nodes is error-prone and creates maintenance debt. GPU Operator automates this:

# Install GPU Operator via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set mig.strategy=single \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set dcgmExporter.serviceMonitor.enabled=true

# Verify all components are running
kubectl -n gpu-operator get pods

GPU Operator manages:

  • NVIDIA Driver: Installed as a container (no host OS modification)
  • CUDA Toolkit: Available to all Pods requesting GPUs
  • Container Toolkit: Configures containerd/Docker for GPU access
  • Device Plugin: Advertises nvidia.com/gpu resources to Kubernetes
  • DCGM Exporter: GPU metrics (temperature, utilization, ECC errors) for Prometheus
  • MIG Manager: Partitions A100/H100 into MIG instances

MIG (Multi-Instance GPU) configuration:

# ConfigMap for MIG profile on H100 nodes
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-balanced:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 2    # 2 instances at 10 GB each
            "2g.20gb": 2    # 2 instances at 20 GB each
            "3g.40gb": 1    # 1 instance at 40 GB

Volcano Scheduler

The default Kubernetes scheduler is not designed for HPC batch workloads. It cannot do gang scheduling, fairshare, or backfill. Volcano (CNCF project) adds these capabilities:

# Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# Verify Volcano scheduler is running
kubectl -n volcano-system get pods

Gang Scheduling

Gang scheduling ensures that all Pods of a distributed training job start simultaneously — or none start. Without gang scheduling, some Pods start and consume GPU resources while waiting for other Pods, causing deadlock or severe resource waste.

# PyTorch distributed training job with gang scheduling
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: pytorch-train
spec:
  minAvailable: 8          # gang: all 8 workers must start together, or none
  schedulerName: volcano
  queue: research-queue
  tasks:
  - replicas: 8
    name: worker
    template:
      spec:
        containers:
        - name: pytorch
          image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
          resources:
            limits:
              nvidia.com/gpu: "1"
          command:
          - torchrun
          - --nnodes=8
          - --nproc_per_node=1
          - train.py

If 7 of 8 worker Pods would fit on available nodes but the 8th would not, Volcano holds all 8 rather than starting a partial job that would deadlock.

Queue Management

Volcano queues implement team-level resource quotas:

# Define queues with resource weights
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: research-team-a
spec:
  weight: 40                # 40% of cluster resources
  capability:
    nvidia.com/gpu: "32"    # hard limit: max 32 GPUs for this queue
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: research-team-b
spec:
  weight: 35
  capability:
    nvidia.com/gpu: "24"

Preemption

Higher-priority jobs can preempt lower-priority ones:

# High-priority job that can preempt
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: urgent-inference
spec:
  priorityClassName: high-priority   # must be defined as PriorityClass
  preemptable: true
  ...

Multi-Tenant Architecture

For multi-tenant GPU clusters, combine namespace isolation, RBAC, ResourceQuota, and Volcano:

# Namespace for Team A
apiVersion: v1
kind: Namespace
metadata:
  name: team-a
---
# Resource quota for Team A's namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.nvidia.com/gpu: "16"
    limits.nvidia.com/gpu: "16"
    requests.memory: "512Gi"
    persistentvolumeclaims: "20"
---
# LimitRange ensures Pods don't request unlimited memory
apiVersion: v1
kind: LimitRange
metadata:
  name: team-a-limits
  namespace: team-a
spec:
  limits:
  - type: Container
    default:
      memory: "8Gi"
    defaultRequest:
      memory: "4Gi"
    max:
      memory: "128Gi"
      nvidia.com/gpu: "8"

Installation: 4-Phase Process

Phase 1 — Requirements and design:

  • Determine GPU node count and models (H100, A100, L40S)
  • Choose single-zone or multi-zone topology
  • Select network plugin: Calico/Cilium for standard, Multus + SR-IOV for InfiniBand
  • Define tenant structure (teams, quotas)

Phase 2 — Node preparation:

  • Install Ubuntu 22.04 or RHEL 9 on all nodes
  • Configure BIOS: IOMMU enabled, SR-IOV enabled (for InfiniBand passthrough)
  • Install MLNX_OFED if using InfiniBand
  • Verify NTP synchronization

Phase 3 — Kubernetes and component installation:

# Install Kubernetes (kubeadm method)
kubeadm init --pod-network-cidr=10.244.0.0/16 \
  --apiserver-advertise-address=<control-plane-ip>

# Install CNI
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

# Join worker nodes
kubeadm join <control-plane-ip>:6443 --token <token> --discovery-token-ca-cert-hash <hash>

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace

# Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# Install monitoring (Prometheus + Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring

Phase 4 — Validation and training:

# Verify GPU resources visible
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'

# Run NCCL all-reduce test across 4 nodes
kubectl apply -f nccl-test-job.yaml

# Check GPU metrics
kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80
# Open DCGM dashboard in Grafana

InfiniBand with Multus CNI

For distributed training that requires InfiniBand RDMA:

# NetworkAttachmentDefinition for InfiniBand
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: ib-network
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "ib-network",
      "type": "sriov",
      "deviceID": "0000:02:00.0",
      "ipam": {
        "type": "host-local",
        "subnet": "192.168.20.0/24"
      }
    }
---
# Pod with both standard network and InfiniBand
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: ib-network
spec:
  containers:
  - name: nccl-test
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: "1"
        rdma/hca: "1"         # InfiniBand HCA resource

Common Problems

GPU Operator DaemonSet Pods CrashLoopBackOff:

kubectl -n gpu-operator describe pod nvidia-driver-daemonset-xxx
# Often: driver version mismatch with kernel
# Fix: Update ClusterPolicy to match running kernel version
kubectl -n gpu-operator edit clusterpolicy
# Update spec.driver.version to match 'uname -r'

Volcano Job stuck in Pending:

kubectl describe vcjob pytorch-train
# Common reasons:
# minAvailable not met: not enough GPU nodes available
# Queue resource limit hit: team has exceeded quota

# Check queue status
kubectl get queue
kubectl describe queue research-queue

Namespace isolation violations:

# Check if LimitRange is applied
kubectl -n team-a describe limitrange team-a-limits

# Without LimitRange, Pods can request unlimited resources
# even with ResourceQuota defined
kubectl -n team-a get resourcequota -o yaml

For Kubernetes GPU cluster deployment, GPU Operator configuration, and Volcano scheduler tuning, contact Mevasis.