AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1#

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 1
February 07, 2025, by Victor Robles

As organizations scale their AI inference workloads, they face the challenge of efficiently deploying and managing large language models across GPU infrastructure. This three-part blog series provides a production-ready foundation for orchestrating AI inference workloads on the AMD Instinct platform with Kubernetes.

In this post, we’ll establish the essential infrastructure by setting up a Kubernetes cluster using MicroK8s, configuring Helm for streamlined deployments, implementing persistent storage for model caching, and installing the AMD GPU Operator to seamlessly integrate AMD hardware with Kubernetes.

Part 2 will focus on deploying and scaling the vLLM inference engine, implementing MetalLB for load balancing, and optimizing multi-GPU deployments to maximize the performance of AMD Instinct accelerators.

The series concludes in Part 3 with implementing Prometheus for metrics collection, Grafana for performance visualization, and Open WebUI for interacting with models deployed with vLLM.

Let’s begin with Part 1, where we’ll build the foundational components needed for a production-ready AI inference platform.

MI300X Test System Specifications#

Node Type

Supermicro AS -8125GS-TNMR2

CPU

2x AMD EPYC 9654 96-Core Processor

Memory

24x 96GB 2R ddr5-4800 dual rank

GPU

8x MI300X

OS

Ubuntu 22.04.4 LTS

Kernel

6.5.0-45-generic

ROCm

6.2.0

AMD GPU driver

6.7.0-1787201.22.04

Install Kubernetes (microk8s)#

Microk8s is a lightweight, but powerful kubernetes instance that can run on as little as a single node and can scale up to mid-level cluster sizes. Here we run the snap install to load microk8s on the host node.

sudo apt update
sudo apt install snapd
sudo snap install microk8s --classic

Add the current user to the microk8s group and create the .kube directory in your home directory to store the microk8s config file.

sudo usermod -a -G microk8s $USER
mkdir -p ~/.kube
chmod 0700 ~/.kube

Register the microk8s group you just added in your current shell

newgrp microk8s

Note

For convenience you can add the following alias to your bashrc to use the native “kubectl” command with microk8s echo "alias kubectl='microk8s kubectl'" >> ~/.bashrc; source ~/.bashrc

Let’s confirm our cluster is up and running.

kubectl get nodes

We should see the STATUS as Ready

NAME                 STATUS   ROLES    AGE   VERSION
mi300x-server01      Ready    <none>   5m   v1.31.3

Since this is a single-node instance of Kubernetes, we will need to label our node as a control-plane node in order for the AMD GPU Operator to successfully find the node.

Note

For vanilla Kubernetes installations this is not required, but if you’re running Kubernetes on a single master node you will need to remove the taint to be able to schedule jobs on this node

We can do this in one line

kubectl label node $(kubectl get nodes --no-headers | grep "<none>" | awk '{print $1}') node-role.kubernetes.io/control-plane=''

Let’s confirm the node has been labeled correctly

kubectl get nodes

We should see the ROLES as control-plane

NAME                 STATUS   ROLES           AGE   VERSION
mi300x-server01      Ready    control-plane   8m   v1.31.3

Install Helm#

The AMD GPU Operator, Grafana, and Prometheus installations are faciliated by using Helm charts. To install the latest version of Helm we will download the install script from the Helm repository and run it.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

If you are utilizing microk8s as your Kubernetes instance, you also need to tell Helm where the KUBECONFIG environment variable is for microk8s. This is accomplished by exporting the environment variable into the .bashrc of the current user.

echo "export KUBECONFIG=/var/snap/microk8s/current/credentials/client.config" >> ~/.bashrc; source ~/.bashrc

Note

For vanilla Kubernetes installations this is not required unless you also wish to change the default location $HOME/.kube/config

Persistent Storage#

Next, we enable persistent storage so that we don’t have to keep downloading the model when starting or scaling up an instance of the vLLM inference server. Here are methods for both microk8s and vanilla Kubernetes.

Microk8s#

Enable storage on microk8s with:

microk8s enable storage

Create a persistent volume claim (PVC) using vllm-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim  # Defines a PersistentVolumeClaim (PVC) resource
metadata:
  name: llama-3.2-1b  # Name of the PVC, used for reference in deployments
  namespace: default  # Kubernetes namespace where this PVC resides
spec:
  accessModes:
  - ReadWriteOnce  # Access mode indicating the volume can be mounted as read-write by a single node
  resources:
    requests:
      storage: 50Gi  # Specifies the amount of storage requested for this volume
  storageClassName: microk8s-hostpath  # Specifies the storage class to use (e.g., MicroK8s hostPath)
  volumeMode: Filesystem  # Indicates that the volume will be formatted as a filesystem

Apply the persistent volume claim in the console with:

kubectl apply -f vllm-pvc.yaml

Note

For microk8s, hostpath provisioner will store all volume data in /var/snap/microk8s/common/default-storage on the host machine.

Vanilla Kubernetes#

In environments without dynamic storage provisioning, define a PersistentVolume (PV) that the PVC can bind to with pv.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: llama-3.2-1b-pv  # Name of the PersistentVolume
  namespace: default
spec:
  capacity:
    storage: 50Gi  # Amount of storage available in this PV
  accessModes:
  - ReadWriteOnce  # Access mode for the PV
  hostPath:
    path: /mnt/data/llama  # Path on the host where the data is stored
  volumeMode: Filesystem  # Indicates that the volume will be formatted as a filesystem

Define a PersistentVolumeClaim (PVC) with vllm-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama-3.2-1b
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce  # Access mode indicating the volume can be mounted as read-write by a single node
  resources:
    requests:
      storage: 50Gi  # Amount of storage requested
  volumeMode: Filesystem  # Indicates that the volume will be formatted as a filesystem

Note

The PVC will bind to the specific PV, based on matching storage and accessModes attributes.

Apply the two manifests:

kubectl apply -f pv.yaml
kubectl apply -f vllm-pvc.yaml

Install the AMD GPU Operator#

We proceed with installing the AMD GPU Operator. First we need to install cert-manager as a prerequisite by adding the Jetstack repo to Helm.

helm repo add jetstack https://charts.jetstack.io

Then run the install for cert-manager via Helm.

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.15.1 \
  --set crds.enabled=true

Finally, install the AMD GPU Operator.

# Add the Helm repository
helm repo add rocm https://rocm.github.io/gpu-operator
helm repo update

# Install the GPU Operator
helm install amd-gpu-operator rocm/gpu-operator-charts \
  --namespace kube-amd-gpu --create-namespace

Configure AMD GPU Operator#

For our workload we will use a default device-config.yaml file to configure the AMD GPU Operator. This sets up the device plugin, node labeler, and metrics exporter to properly assign AMD GPUs to workloads, label the nodes that have AMD GPUs, and export metrics for Grafana and Prometheus. For a full list of configurable options refer to the Full Reference Config page of the AMD GPU Operator documentation.

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: gpu-operator
  # use the namespace where AMD GPU Operator is running
  namespace: kube-amd-gpu
spec:
  driver:
    # disable the installation of out-of-tree amdgpu kernel module
    enable: false

  devicePlugin:
    # Specify the device plugin image
    # default value is rocm/k8s-device-plugin:latest
    devicePluginImage: rocm/k8s-device-plugin:latest

    # Specify the node labeller image
    # default value is rocm/k8s-device-plugin:labeller-latest
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest

    # Specify to enable/disable the node labeller
    # node labeller is required for adding / removing blacklist config of amdgpu kernel module
    # please set to true if you want to blacklist the inbox driver and use our-of-tree driver
    enableNodeLabeller: true
        
  metricsExporter:
    # To enable/disable the metrics exporter, disabled by default
    enable: true
    # kubernetes service type for metrics exporter, clusterIP(default) or NodePort
    serviceType: "NodePort"
    # internal service port used for in-cluster and node access to pull metrics from the metrics-exporter (default 5000)
    port: 5000
    # Node port for metrics exporter service, metrics endpoint $node-ip:$nodePort
    nodePort: 32500
    # exporter image
    image: "docker.io/rocm/device-metrics-exporter:v1.0.0"

  # Specify the node to be managed by this DeviceConfig Custom Resource
  selector:
    feature.node.kubernetes.io/amd-gpu: "true"

We apply the device-config file as:

kubectl apply -f device-config.yaml

To confirm the node labeler is working we run

kubectl get nodes -L feature.node.kubernetes.io/amd-gpu

and we should see AMD-GPU set as true

NAME                 STATUS   ROLES           AGE    VERSION   AMD-GPU
mi300x-server01      Ready    control-plane   15m   v1.31.3   true

To show available GPUs for workloads we can use:

kubectl get nodes -o custom-columns=NAME:.metadata.name,"Total GPUs:.status.capacity.amd\.com/gpu","Allocatable GPUs:.status.allocatable.amd\.com/gpu"

Now, we can see the total available GPUs

NAME             Total GPUs   Allocatable GPUs
mi300x-server01  8            8

Summary#

In this post, we’ve established a solid foundation for AI inference workloads by:

  • Setting up a Kubernetes cluster with MicroK8s

  • Configuring essential components like Helm

  • Implementing persistent storage for model management

  • Installing and validating the AMD GPU Operator

The next installment in this series will walk through deploying and scaling vLLM for inference, implementing MetalLB for load balancing, and optimizing multi-GPU deployments on AMD Instinct hardware. Part 3 will round out the series by deploying Open WebUI as a front end and configuring monitoring and management with Prometheus and Grafana. Stay tuned!

Disclaimers
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.