Navigating vLLM Inference with ROCm and Kubernetes#

Kubernetes (often abbreviated as K8s) is an open-source platform designed for automating the deployment, scaling, and management of containerized applications. Developed by Google and now maintained by the Cloud Native Computing Foundation, Kubernetes enables developers to build, run, and manage applications across any infrastructure.
Key features of Kubernetes include:
Automation: Automates application deployment, rolling updates, scaling, and maintenance.
Container Orchestration: Manages containerized applications efficiently across a cluster of nodes.
Scalability: Can scale applications up or down as needed based on demand.
Self-Healing: Automatically replaces, restarts, or reschedules containers when they fail or become unresponsive.
Load Balancing: Distributes traffic among container replicas to ensure even load distribution and high availability.
Kubernetes operates around a master-node architecture, where the master node contains control plane components that manage the cluster’s state and the worker nodes run user applications in pods. This structure ensures high availability and reliability, as it allows for redundancy in managing the cluster.
In this blog, we will walk you through deploying a vLLM Inference Service by leveraging the power of ROCm and our AMD K8s device plugin. By following these steps, you’ll be able to take advantage of Kubernetes to manage clusters equipped with powerful AI accelerators like the AMD Instinct™ MI300X, ideal for flexible AI inference workloads.
To learn more about the specifications and performance capabilities of AMD Instinct™ accelerators, visit our product page.
Setting up the K8s Cluster and vLLM#
This blog is tailored for MLOps engineers experienced in Kubernetes (K8s) cluster deployments with AMD Instinct™ accelerators and familiar with vLLM’s LLM inference solutions. We assume a deep technical foundation for advanced infrastructure implementations.
If you’re new to Kubernetes, start your journey by exploring the K8s official documentation.
If you’re new to vLLM, which is a widely-used LLM inference deployment solution, start learning from its official documentation: https://docs.vllm.ai/en/latest/
As an added benefit of the collaboration between AMD and vLLM, customers can leverage the latest insights and best practices for accelerating AI inference deployments. For the most up-to-date information on ROCm and vLLM, check out these resources.
This blog assumes you already have an existing Kubernetes cluster setup that is accessible to you to follow along with this tutorial. In addition, you will need to have already installed and loaded the AMD GPU driver on each of your GPU nodes.
Install the k8s-device-plugin#
The device plugin will enable registration of AMD GPU to a container cluster.
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml
To verify that AMD GPUs are now schedulable as a Kubernetes resource on your cluster verify that you now see the new resource type amd.com/gpu
listed for the GPU nodes in your cluster. The below command should return the number of GPUs you have available on each node.
kubectl get nodes -o custom-columns=NAME:.metadata.name,"GPUs Total:.status.capacity.amd\.com/gpu","GPUs Allocatable:.status.allocatable.amd\.com/gpu"
Prepare the K8s yaml files#
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models.
Here is the example
hf_token.yaml
apiVersion: v1 kind: Secret metadata: name: hf-token-secret namespace: default type: Opaque data: token: "REPLACE_WITH_TOKEN"
NOTE: you should use base64 to encode your HF TOKEN for the hf_token.yaml
echo -n `<your HF TOKEN>` | base64
Define the deployment workload,
deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: mistral-7b namespace: default labels: app: mistral-7b spec: replicas: 1 selector: matchLabels: app: mistral-7b template: metadata: labels: app: mistral-7b spec: volumes: # vLLM needs to access the host's shared memory for tensor parallel inference. - name: shm emptyDir: medium: Memory sizeLimit: "8Gi" hostNetwork: true hostIPC: true containers: - name: mistral-7b image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 securityContext: seccompProfile: type: Unconfined capabilities: add: - SYS_PTRACE command: ["/bin/sh", "-c"] args: [ "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024" ] env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: token ports: - containerPort: 8000 resources: limits: cpu: "10" memory: 20G amd.com/gpu: "1" requests: cpu: "6" memory: 6G amd.com/gpu: "1" volumeMounts: - name: shm mountPath: /dev/shm
Define the
service.yaml
apiVersion: v1 kind: Service metadata: name: mistral-7b namespace: default spec: ports: - name: http-mistral-7b port: 80 protocol: TCP targetPort: 8000 # The label selector should match the deployment labels & it is useful for prefix caching feature selector: app: mistral-7b sessionAffinity: None type: ClusterIP
Launch the pods#
Next, let’s enable the vLLM inference service using K8s.
kubectl apply -f hf_token.yaml # Apply the Hugging Face token configuration to allow model downloading
kubectl apply -f deployment.yaml # Deploy the application by creating pods as defined in the deployment manifest
kubectl apply -f service.yaml # Expose the deployed application as a service for vLLM inference requests
Test the service#
Get the Service IP by
kubectl get svc
The mistral-7b is the service name. We can access the vllm serve by the CLUSTER-IP and PORT of it like,
Get models by (please use the real CLUSTER-IP of your environment)
curl http://<CLUSTER-IP>:80/v1/models
Do request
curl http://<CLUSTER-IP>:80/v1/completions -H "Content-Type: application/json" -d '{
"model": "mistralai/Mistral-7B-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
You may get response from the K8s vLLM inference service we setup above for your prompt question, “San Francisco is a” from the terminal like that,
{"id":"cmpl-ede35e4d7f654d70a8a18e8560052251","object":"text_completion","created":1738647037,"model":"mistralai/Mistral-7B-v0.3","choices":[{"index":0,"text":" city that is known for its diversity","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
Summary#
In this blog, we’ve shown you step-by-step how to deploy and verify the vLLM service using Kubernetes (K8s) on AMD Instinct™ accelerators. With this foundation in place, you’re now ready to begin integrating vLLM into your real-world AI applications. While the AMD K8s device plugin used in this tutorial is simple and great for single node or small Kubernetes implementations, for larger Kubernetes deployments please consider using the AMD GPU Operator which allows for automated installation of the K8s device plugin, node labeller, and AMD GPU device drivers on all your nodes without having to install them on each node manually. It also provides many other benefits such as GPU metrics reporting.
You may explore more examples of using K8s on AMD GPU:
Disclaimers#
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.