Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)#
As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.
Getting things ready: OCI OKE, RoCE, SDXL and Hugging Face Accelerate#
This blog post will guide you through the set-up and fine-tuning of an SDXL model on a multinode setup with 8 GPUs on each node. A working configuration file is provided on this blog’s GitHub repository, along with instructions on how to customize the setup to meet your specific workload requirements.
Oracle Kubernetes Engine (OKE) provides a robust and flexible platform for deploying, managing, and scaling containerized applications in the Oracle Cloud. It allows users to easily orchestrate multinode setups, making it an ideal environment for training large models on multiple AMD processors.
A critical aspect of our setup involves using RDMA over Converged Ethernet (RoCE) for internode communication. RoCE offers significant benefits over traditional Ethernet connections. RoCE reduces latency and increases data transfer speeds between nodes, leading to improved overall performance in distributed training scenarios.
Stable Diffusion XL (SDXL) is a generative AI model designed for high-quality image synthesis with text prompts. Fine-tuning this model involves adapting it to specific tasks or datasets, tailoring its image styles to match those of the dataset.
In this post we will use the lambdalabs/naruto-blip-captions
dataset from Hugging Face. This dataset contains images originally obtained from Narutopedia, which were captioned using the pre-trained BLIP model. Hugging Face Accelerate is used in this process to simplify and optimize training.
Keep in mind that while this blog provides an example for setting up and fine-tuning the SDXL model in a multinode environment on OCI, a similar approach can be used for setting up and fine-tuning other generative AI models.
Note: This blog focuses on setting up multinode fine-tuning, which can be easily adapted for pre-training, rather than on performance studies of multinode setups.
Implementation#
We implemented the multinode fine-tuning of SDXL on an OCI cluster with multiple nodes. Each node contains 8 AMD MI300x GPUs, and you can adjust the number of nodes based on your available resources in the scripts we will walk you through in the following section.
Your host machine will interact with the OKE cluster using kubectl
commands. To ensure that kubectl
interacts with the correct Kubernetes cluster, set the KUBECONFIG
environment variable to point to the location of the configuration file (Be sure to update the path to point to your specific config file):
# Please update the path to your own config file.
export KUBECONFIG=<your_own_config_file>
Because Weights & Biases (wandb
) will be used to track the fine-tuning progress and a Hugging Face dataset will be used for fine-tuning, you will need to generate an OKE “secret” using a wandb
API key and a Hugging Face token. An OKE secret is a Kubernetes object used to securely store and manage sensitive information such as passwords, tokens, and SSH keys. An OKE secret allows confidential data to be passed to your pods and containers securely.
# Create a secret for the WANDB API Key
kubectl create secret generic wandb-secret --from-literal=WANDB_API_KEY=<Your_wandb_API_key>
# Create a secret for the Hugging Face token
kubectl create secret generic hf-secret --from-literal=HF_TOKEN=<Your_Hugging_Face_token>
Create a Kubernetes ConfigMap by downloading the Hugging Face configuration file, default_config_accelerate.yaml
, from this blog’s GitHub repository src
folder to your working directory on your host machine and runing the command below:
kubectl create configmap accelerate-config --from-file=default_config_accelerate.yaml
A Kubernetes ConfigMap stores configuration information in key-value pairs. The command above will create a ConfigMap that contains the default_config_accelerate.yaml
file. This will allow your pods to access and use the Hugging Face Accelerate configuration.
Download accelerate_blog_multinode.yaml
from this blog’s GitHub repository src
folder to your host machine and adjust the paths in the file to align with your actual file system.
Start the fine-tuning process by running the command below:
kubectl apply -f accelerate_blog_multinode.yaml
If everything is set up correctly, you’ll see the following output:
service/sdxl-headless-svc created
configmap/sdxl-finetune-multinode-config created
job.batch/sdxl-finetune-multinode created
You can then monitor the fine-tuning progress via the wandb
dashboard or other supported reporting and logging platforms.
The accelerate_blog_multinode.yaml
file is organized into three sections, Service, ConfigMap, and Job, separated by dashes (---
). Breaking down the accelerate_blog_multinode.yaml
file into these three sections reveals how Kubernetes orchestrates the different components necessary for the multinode fine-tuning of SDXL. This modular approach provides flexibility and simplifies the management of complex workloads, especially in a multinode environment.
The Service section specifies how the application running within the pods should be exposed as a service both within the cluster and externally. This includes the service name and type, and the ports the service will use for communication. This setup ensures robust inter-node communication within the cluster.
Service section of the yaml file (click to expand)
apiVersion: v1 kind: Service metadata: name: sdxl-headless-svc spec: clusterIP: None ports: - port: 12342 protocol: TCP targetPort: 12342 selector: job-name: sdxl-finetune-multinode
The ConfigMap section provides additional key-value pairs for inter-node communication. Storing these settings in this section makes it easier to manage and update them without altering the container images or pod specifications.
ConfigMap section of the yaml file (click to expand)
apiVersion: v1 kind: ConfigMap metadata: name: sdxl-finetune-multinode-config data: headless_svc: sdxl-headless-svc job_name: sdxl-finetune-multinode master_addr: sdxl-finetune-multinode-0.sdxl-headless-svc master_port: '12342' num_replicas: '3'
The Job section provides details about how the fine-tuning process. This includes information such as which container image to use, which resources to allocate to the job, which command to run within the container, and how the ConfigMap for the Accelerate settings should be mounted into the container.
Job section of the yaml file (click to expand)
apiVersion: batch/v1 kind: Job metadata: name: sdxl-finetune-multinode spec: backoffLimit: 0 completions: 3 parallelism: 3 completionMode: Indexed template: metadata: labels: job: sdxl-multinode-job spec: hostNetwork: true dnsPolicy: ClusterFirstWithHostNet containers: - name: accelerate-sdxl image: rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.1.2 securityContext: privileged: true capabilities: add: [ "IPC_LOCK" ] env: - name: HIP_VISIBLE_DEVICES value: "0,1,2,3,4,5,6,7" - name: HIP_FORCE_DEV_KERNARG value: "1" - name: GPU_MAX_HW_QUEUES value: "2" - name: USE_ROCMLINEAR value: "1" - name: NCCL_SOCKET_IFNAME value: "rdma0" - name: MASTER_ADDRESS valueFrom: configMapKeyRef: key: master_addr name: sdxl-finetune-multinode-config - name: MASTER_PORT valueFrom: configMapKeyRef: key: master_port name: sdxl-finetune-multinode-config - name: NCCL_IB_HCA value: "mlx5_0,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_7,mlx5_8,mlx5_9" - name: HEADLESS_SVC valueFrom: configMapKeyRef: key: headless_svc name: sdxl-finetune-multinode-config - name: NNODES valueFrom: configMapKeyRef: key: num_replicas name: sdxl-finetune-multinode-config - name: NODE_RANK valueFrom: fieldRef: fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] - name: WANDB_API_KEY valueFrom: secretKeyRef: name: wandb-secret key: WANDB_API_KEY - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: HF_TOKEN volumeMounts: - mountPath: /mnt name: model-weights-volume - mountPath: /etc/config name: diffusers-config-volume - { mountPath: /dev/infiniband, name: devinf } - { mountPath: /dev/shm, name: shm } resources: requests: amd.com/gpu: 8 limits: amd.com/gpu: 8 command: ["/bin/bash", "-c", "--"] args: - | # Clone the GitHub repo git clone --recurse https://github.com/ROCm/bitsandbytes.git cd bitsandbytes git checkout rocm_enabled # Install dependencies pip install -r requirements-dev.txt # Use -DBNB_ROCM_ARCH to specify target GPU arch cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S . make pip install . cd .. # Set up Hugging Face authentication using the secret mkdir -p ~/.huggingface echo $HF_TOKEN > ~/.huggingface/token pip install deepspeed==0.14.5 wandb git clone https://github.com/huggingface/diffusers && cd diffusers && pip install -e . && cd examples/text_to_image && pip install -r requirements_sdxl.txt export EXP_DIR=./output mkdir -p output LOG_FILE="${EXP_DIR}/sdxl_$(date '+%Y-%m-%d_%H-%M-%S')_MI300_SDXL_FINETUNE.log" export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" export DATASET_NAME="lambdalabs/naruto-blip-captions" export ACCELERATE_CONFIG_FILE="/etc/config/default_config_accelerate.yaml" export HF_HOME=/mnt/huggingface accelerate launch --config_file $ACCELERATE_CONFIG_FILE \ --main_process_ip $MASTER_ADDRESS \ --main_process_port $MASTER_PORT \ --machine_rank $NODE_RANK \ --num_processes $((8 * NNODES)) \ --num_machines $NNODES train_text_to_image_sdxl.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --pretrained_vae_model_name_or_path=$VAE_NAME \ --dataset_name=$DATASET_NAME \ --resolution=512 --center_crop --random_flip \ --proportion_empty_prompts=0.1 \ --train_batch_size=12 \ --gradient_checkpointing \ --num_train_epochs=500 \ --use_8bit_adam \ --learning_rate=1e-04 --lr_scheduler="cosine" --lr_warmup_steps=200 \ --mixed_precision="fp16" \ --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 20 \ --checkpointing_steps=1000 \ --report_to="wandb" \ --output_dir="sdxl-naruto-model" 2>&1 | tee "$LOG_FILE" sleep 30m volumes: - name: model-weights-volume hostPath: path: /mnt/model_weights type: Directory - name: diffusers-config-volume configMap: name: accelerate-config - { name: devinf, hostPath: { path: /dev/infiniband }} - { name: shm, emptyDir: { medium: Memory, sizeLimit: 512Gi }} restartPolicy: Never subdomain: sdxl-headless-svc
Please note that you can change the num_replicas
, completions
, and parallelism
values in the accelerate_blog_multinode.yaml
file to any number between 1 and the total number of nodes your cluster has, to specify how many nodes to use for the multinode workload. Currently, we are using 3
, meaning we are implementing fine-tuning on 3 nodes, each with 8 GPUs. You will need to modify the configuration to match your actual infrastructure. For instance, if your nodes only have 4 GPUs each, you’ll need to change the number of GPUs requested per job (resources->requests->amd.com/gpu:
) to 4, ensuring the correct number of nodes is allocated.
The sleep 30m
command at the end of the Job section under spec -> template -> spec ->containers -> args
keeps the job running for an additional 30 minutes after it completes. This gives you time to review the results and troubleshoot any issues. You can adjust the sleep time or remove this command as needed.
Summary#
In this blog post we showed you how to set-up and fine-tune a generative AI model on Oracle Cloud Infrastructure’s (OCI) Oracle Kubernetes Engine (OKE) using a cluster of AMD GPUs. You can use this tutorial as a starting point and adjust the YAML file to reflect your own network resources and the specific needs of your particular task.
Acknowledgment#
We want to thank the OCI team for helping us set up the multinode environment to implement the workload.
Disclaimers#
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.