AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 2#

AI Inference Orchestration with Kubernetes on Instinct MI300X, Part 2
February 14, 2025 by Victor Robles.
9 min read. | 2120 total words.

Welcome to Part 2 of our series on utilizing Kubernetes with the AMD Instinct platform! If you’re just joining us, we recommend checking out Part 1 where we covered setting up your Kubernetes cluster and enabling AMD GPU support.

In this post, we’ll guide you through deploying and scaling the vLLM inference engine, implementing MetalLB for efficient load balancing, and optimizing multi-GPU deployments to maximize the performance of your AMD Instinct accelerators. By the end of this guide, we’ll have a scalable and production-ready AI inference solution in place.

Deploying vLLM#

Let’s start by deploying the vLLM inference engine using the amd/Llama-3.2-1B-Instruct-FP8-KV model. We’ll leverage the AMD Instinct GPU with the persistent volume claim we established in Part 1. Here’s our vllm-deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-3-2-1b  # Name of the deployment
  namespace: default  # Kubernetes namespace where the deployment resides
  labels:
    app: llama-3-2-1b  # Label to identify the app
spec:
  replicas: 1  # Number of replicas (instances) of the deployment
  selector:
    matchLabels:
      app: llama-3-2-1b  # Ensures the pods match this label for management
  template:
    metadata:
      labels:
        app: llama-3-2-1b  # Labels assigned to the pod for identification
    spec:
      volumes:
      - name: cache-volume  # Volume for persistent storage (e.g., caching model files)
        persistentVolumeClaim:
          claimName: llama-3.2-1b  # PersistentVolumeClaim that backs this volume
      # vLLM requires shared memory for tensor parallel inference.
      - name: shm  # Volume for shared memory (in-memory storage for high-speed access)
        emptyDir:
          medium: Memory  # Use memory as the storage medium
          sizeLimit: "2Gi"  # Limit shared memory size to 2GiB
      containers:
      - name: llama-3-2-1b  # Name of the container running inside the pod
        image: docker.io/rocm/vllm-dev:main  # Docker image for vLLM with ROCm GPU support
        command: ["/bin/sh", "-c"]  # Command to run in the container
        args: [
          "vllm serve amd/Llama-3.2-1B-Instruct-FP8-KV --chat-template /app/vllm/examples/tool_chat_template_llama3.2_json.jinja" # Command argument to serve the specified model
        ]        
        ports:
        - containerPort: 8000  # Port exposed by the container for inference requests
        resources:
          limits:
            cpu: "10"  # Maximum CPU resources the container can use
            memory: 20G  # Maximum memory the container can use
            amd.com/gpu: "1"  # Limit to 1 AMD GPU
          requests:
            cpu: "2"  # Minimum CPU resources requested for the container
            memory: 6G  # Minimum memory resources requested for the container
            amd.com/gpu: "1"  # Request 1 AMD GPU for this container
        volumeMounts:
        - mountPath: /root/.cache/huggingface  # Path where the cache volume is mounted
          name: cache-volume  # Reference to the cache volume defined earlier
        - name: shm  # Mount the shared memory volume
          mountPath: /dev/shm  # Path to shared memory inside the container

After creating this file, deploy it to your cluster with:

kubectl apply -f vllm-deployment.yaml

Once the deployment is available, we ping the cluster to determine how many GPUs are being utilized by our workload:

kubectl describe nodes  |  tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p'  | grep -e Name  -e  amd.com  | perl -pe 's/\n//'  |  perl -pe 's/Name:/\n/g' | sed 's/amd.com\/gpu:\?//g'  | sed '1s/^/Node Available(GPUs)  Used(GPUs)/' | sed 's/$/ 0 0 0/'  | awk '{print $1, $2, $3}'  | column -t

The output shows a single GPU in use:

Node                Available(GPUs)  Used(GPUs)
mi300x-server01     8                1

Monitor the progress of the deployment with:

kubectl get pods

The status will display ContainerCreating while Kubernetes is pulling the container and downloading the model.

NAME                            READY   STATUS              RESTARTS   AGE
llama-3-2-1b-6587cc94f9-lt9j8   0/1     ContainerCreating   0          7m44s

Note

Depending on the speed of your internet connection the image and model download can take a considerable amount of time.

Once the deployment is ready, the STATUS changes to Running.

NAME                            READY   STATUS              RESTARTS   AGE
llama-3-2-1b-6587cc94f9-lt9j8   1/1     Running             0          7m44s

Expose the API#

With vLLM up and running, our next step is making it accessible to end users. We’ll accomplish this by creating a service that exposes our deployment externally. The following vllm-service.yaml manifest ensures that our Llama model is accessible in a scalable and manageable way:

Create the service#

First we create the vllm-service.yaml manifest.

apiVersion: v1  # Defines the API version for the Service resource
kind: Service  # Specifies the resource type as a Service
metadata:
  name: llama-3-2-1b  # Name of the Service, used for referencing in other resources
  namespace: default  # Kubernetes namespace where the Service is created
spec:
  ports:
  - name: http-llama-3-2-1b  # A descriptive name for the port (optional but recommended)
    port: 80  # The external port that clients will use to connect to the service
    protocol: TCP  # The protocol for communication (default is TCP)
    targetPort: 8000  # The port on the pod where traffic will be forwarded
  selector:
    app: llama-3-2-1b  # Ensures that traffic is routed to pods with this label
  sessionAffinity: None  # No session stickiness; requests can be distributed across pods
  type: LoadBalancer  # Exposes the service externally with a LoadBalancer (e.g., cloud or MetalLB)

Next, apply the manifest:

kubectl apply -f vllm-service.yaml

After applying the service manifest, let’s check its status:

kubectl get svc -n default

You’ll notice the EXTERNAL-IP is in a pending state - this is expected since we haven’t configured our load balancer yet:

NAME           TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
llama-3-2-1b   LoadBalancer   10.152.183.137   <pending>     80:31588/TCP   3m48s

Install the load balancer#

To resolve the pending external IP, we’ll implement MetalLB as our load balancer. This powerful tool will not only assign external IPs but also enable us to efficiently distribute traffic across our scaled deployment.

Install MetalLB#

First, install the MetalLB Custom Resource Definitions:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/main/config/manifests/metallb-native.yaml

This sets up the foundational resources required by MetalLB. Before proceeding further, confirm that the metalLB services are up and running. Check that their respective pods are functional:

kubectl get pods -n metallb-system

The output should show the metallb services STATUS as Running

NAME                         READY   STATUS    RESTARTS   AGE
controller-8b7b6bf6b-jffh9   1/1     Running   0          3m5s
speaker-db2pc                1/1     Running   0          3m5s

Identify the correct network interface#

To configure MetalLB, we need to identify the network interface that Kubernetes uses. Here’s how to find it:

First, check which IP your Kubernetes API server is using:

kubectl get nodes -o wide

Look for the INTERNAL-IP column - this shows which network interface Kubernetes is using. Then, find the corresponding interface details:

ip addr | grep -A2 "<INTERNAL-IP-FROM-PREVIOUS-STEP>"

For example, if you see:

3: enp33s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
    inet 10.216.70.211/22 brd 10.216.71.255 scope global enp33s0f0

This confirms enp33s0f0 is our target interface with subnet /22.

Calculate safe IP range#

To avoid conflicts, follow these guidelines for choosing your IP range:

  1. Identify the network range:

    • For our example /22 subnet (10.216.70.211):

    • Network range: 10.216.68.0 - 10.216.71.255

    • Usable IPs: 10.216.68.1 - 10.216.71.254

  2. Choose a safe range by:

    • Staying away from the gateway (usually .1)

    • Avoiding existing node IPs

    • Using the upper end of the range (less likely to conflict)

For our example, we’ll use: 10.216.71.230 - 10.216.71.240

Warning

Always verify your chosen IP range doesn’t conflict with:

  • Existing server/node IPs

  • DHCP ranges

  • Other network services

Configure MetalLB#

After determining the safe IP range, we setup a subset of IP addresses that can be assigned to our Kubernetes services by creating and applying a metallb-config.yaml.

apiVersion: metallb.io/v1beta1  # Specifies the MetalLB API version
kind: IPAddressPool  # Defines a pool of IP addresses to be managed by MetalLB
metadata:
  name: default  # Name of the IP address pool
  namespace: metallb-system  # Namespace where MetalLB resources are managed
spec:
  addresses:
  - 10.216.71.230-10.216.71.240  # Range of IP addresses available for allocation
---
apiVersion: metallb.io/v1beta1  # Specifies the MetalLB API version
kind: L2Advertisement  # Configures Layer 2 (ARP/NDP) advertisement for the IP pool
metadata:
  name: l2-config  # Name of the L2Advertisement resource
  namespace: metallb-system  # Namespace where MetalLB resources are managed
spec: {}  # Empty spec indicates default configuration for Layer 2 advertisement

Don’t forget to apply the config!

kubectl apply -f metallb-config.yaml

Confirm vLLM service availability#

With the load balancer now in place, we can check to see if the service has been assigned an external address,

kubectl get svc

which should result in an IP address under EXTERNAL-IP

NAME           TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)        AGE
llama-3-2-1b   LoadBalancer   10.152.183.137   10.216.71.230   80:31588/TCP   17m

We can now send a request payload to our vLLM engine with the curl command and prettify the output by piping to jq. In a terminal on the host we run:

curl http://10.216.71.230/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "amd/Llama-3.2-1B-Instruct-FP8-KV",
        "prompt": "San Francisco is a",
        "max_tokens": 8,
        "temperature": 0
      }' | jq .

Note

Be sure the model name matches the model you are utilizing and IP address matches the IP address your load balancer assigned to your service.

You should see a response payload similar to the one below, with the text generated by the model under the choices JSON object.

{
  "id": "cmpl-fec47d49ee844e7390d5aeed317e05d2",
  "object": "text_completion",
  "created": 1733849007,
  "model": "amd/Llama-3.2-1B-Instruct-FP8-KV",
  "choices": [
    {
      "index": 0,
      "text": " city that is known for its beautiful beaches",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 13,
    "completion_tokens": 8,
    "prompt_tokens_details": null
  }
}

Scaling our deployment#

Now that we have our vLLM service running smoothly with load balancing in place, let’s leverage the full power of our AMD Instinct GPUs. Since the Llama 3.2 1B model comfortably fits on a single GPU, we can scale our deployment to utilize all available accelerators.

Note

Be mindful of the total number of CPUs and RAM on your machine. You might have to reconfigure the resources section of the vllm-deployment.yaml manifest so as to not exceed your machine’s number of CPUs and max RAM.

Scaling#

We scale our deployment with:

kubectl scale -n default deployment llama-3-2-1b --replicas=8

The get pods command is used to confirm the new instances of our vLLM deployment have been replicated

get pods
NAME                            READY   STATUS    RESTARTS   AGE
llama-3-2-1b-6587cc94f9-7c2pd   1/1     Running   0          5s
llama-3-2-1b-6587cc94f9-dlkjr   1/1     Running   0          5s
llama-3-2-1b-6587cc94f9-gwknm   1/1     Running   0          5s
llama-3-2-1b-6587cc94f9-lt9j8   1/1     Running   0          32m
llama-3-2-1b-6587cc94f9-mvtqm   1/1     Running   0          5s
llama-3-2-1b-6587cc94f9-ptt7g   1/1     Running   0          5s
llama-3-2-1b-6587cc94f9-s7w52   1/1     Running   0          5s
llama-3-2-1b-6587cc94f9-zppx2   1/1     Running   0          5s

Now when we send API requests to vLLM, our load balancer will decide which pod to send it to by utilizing the Layer 2 networking capabilities of MetalLB.

Summary#

In this post, we’ve made significant progress in building a scalable and high-performance AI inference solution by:

  • Deploying and scaling the vLLM inference engine on Kubernetes.

  • Implementing MetalLB for robust load balancing.

  • Optimizing deployments to harness the full power of AMD Instinct multi-GPU capabilities.

Stay tuned for the third and final part of this series, where we’ll enhance our setup with advanced monitoring and management tools. We’ll walk through configuring Prometheus for GPU performance monitoring, setting up Grafana for insightful visualizations, and implementing Open WebUI for user-friendly model interactions.

Disclaimers
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.