Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)

Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)#

October 15, 2024, by Douglas Jia

As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.

Getting things ready: OCI OKE, RoCE, SDXL and Hugging Face Accelerate#

This blog post will guide you through the set-up and fine-tuning of an SDXL model on a multinode setup with 8 GPUs on each node. A working configuration file is provided on this blog’s GitHub repository, along with instructions on how to customize the setup to meet your specific workload requirements.

Oracle Kubernetes Engine (OKE) provides a robust and flexible platform for deploying, managing, and scaling containerized applications in the Oracle Cloud. It allows users to easily orchestrate multinode setups, making it an ideal environment for training large models on multiple AMD processors.

A critical aspect of our setup involves using RDMA over Converged Ethernet (RoCE) for internode communication. RoCE offers significant benefits over traditional Ethernet connections. RoCE reduces latency and increases data transfer speeds between nodes, leading to improved overall performance in distributed training scenarios.

Stable Diffusion XL (SDXL) is a generative AI model designed for high-quality image synthesis with text prompts. Fine-tuning this model involves adapting it to specific tasks or datasets, tailoring its image styles to match those of the dataset.

In this post we will use the lambdalabs/naruto-blip-captions dataset from Hugging Face. This dataset contains images originally obtained from Narutopedia, which were captioned using the pre-trained BLIP model. Hugging Face Accelerate is used in this process to simplify and optimize training.

Keep in mind that while this blog provides an example for setting up and fine-tuning the SDXL model in a multinode environment on OCI, a similar approach can be used for setting up and fine-tuning other generative AI models.

Note: This blog focuses on setting up multinode fine-tuning, which can be easily adapted for pre-training, rather than on performance studies of multinode setups.

Implementation#

We implemented the multinode fine-tuning of SDXL on an OCI cluster with multiple nodes. Each node contains 8 AMD MI300x GPUs, and you can adjust the number of nodes based on your available resources in the scripts we will walk you through in the following section.

Your host machine will interact with the OKE cluster using kubectl commands. To ensure that kubectl interacts with the correct Kubernetes cluster, set the KUBECONFIG environment variable to point to the location of the configuration file (Be sure to update the path to point to your specific config file):

# Please update the path to your own config file.
export KUBECONFIG=<your_own_config_file>

Because Weights & Biases (wandb) will be used to track the fine-tuning progress and a Hugging Face dataset will be used for fine-tuning, you will need to generate an OKE “secret” using a wandb API key and a Hugging Face token. An OKE secret is a Kubernetes object used to securely store and manage sensitive information such as passwords, tokens, and SSH keys. An OKE secret allows confidential data to be passed to your pods and containers securely.

# Create a secret for the WANDB API Key
kubectl create secret generic wandb-secret --from-literal=WANDB_API_KEY=<Your_wandb_API_key>
# Create a secret for the Hugging Face token
kubectl create secret generic hf-secret --from-literal=HF_TOKEN=<Your_Hugging_Face_token>

Create a Kubernetes ConfigMap by downloading the Hugging Face configuration file, default_config_accelerate.yaml, from this blog’s GitHub repository src folder to your working directory on your host machine and runing the command below:

kubectl create configmap accelerate-config --from-file=default_config_accelerate.yaml

A Kubernetes ConfigMap stores configuration information in key-value pairs. The command above will create a ConfigMap that contains the default_config_accelerate.yaml file. This will allow your pods to access and use the Hugging Face Accelerate configuration.

Download accelerate_blog_multinode.yaml from this blog’s GitHub repository src folder to your host machine and adjust the paths in the file to align with your actual file system.

Start the fine-tuning process by running the command below:

kubectl apply -f accelerate_blog_multinode.yaml 

If everything is set up correctly, you’ll see the following output:

service/sdxl-headless-svc created
configmap/sdxl-finetune-multinode-config created
job.batch/sdxl-finetune-multinode created

You can then monitor the fine-tuning progress via the wandb dashboard or other supported reporting and logging platforms.

The accelerate_blog_multinode.yaml file is organized into three sections, Service, ConfigMap, and Job, separated by dashes (---). Breaking down the accelerate_blog_multinode.yaml file into these three sections reveals how Kubernetes orchestrates the different components necessary for the multinode fine-tuning of SDXL. This modular approach provides flexibility and simplifies the management of complex workloads, especially in a multinode environment.

  1. The Service section specifies how the application running within the pods should be exposed as a service both within the cluster and externally. This includes the service name and type, and the ports the service will use for communication. This setup ensures robust inter-node communication within the cluster.

    Service section of the yaml file (click to expand)
    apiVersion: v1
    kind: Service
    metadata:
      name: sdxl-headless-svc
    spec:
      clusterIP: None
      ports:
      - port: 12342
        protocol: TCP
        targetPort: 12342
      selector:
        job-name: sdxl-finetune-multinode
    
  2. The ConfigMap section provides additional key-value pairs for inter-node communication. Storing these settings in this section makes it easier to manage and update them without altering the container images or pod specifications.

    ConfigMap section of the yaml file (click to expand)
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: sdxl-finetune-multinode-config
    data:
      headless_svc: sdxl-headless-svc
      job_name: sdxl-finetune-multinode
      master_addr: sdxl-finetune-multinode-0.sdxl-headless-svc
      master_port: '12342'
      num_replicas: '3'
    
  3. The Job section provides details about how the fine-tuning process. This includes information such as which container image to use, which resources to allocate to the job, which command to run within the container, and how the ConfigMap for the Accelerate settings should be mounted into the container.

    Job section of the yaml file (click to expand)
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: sdxl-finetune-multinode
    spec:
      backoffLimit: 0
      completions: 3
      parallelism: 3
      completionMode: Indexed
      template:
        metadata:
          labels:
            job: sdxl-multinode-job
        spec:
          hostNetwork: true
          dnsPolicy: ClusterFirstWithHostNet
          containers:
            - name: accelerate-sdxl
              image: rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.1.2
              securityContext:
                privileged: true
                capabilities:
                  add: [ "IPC_LOCK" ]
              env:
              - name: HIP_VISIBLE_DEVICES
                value: "0,1,2,3,4,5,6,7"
              - name: HIP_FORCE_DEV_KERNARG
                value: "1"
              - name: GPU_MAX_HW_QUEUES
                value: "2"
              - name: USE_ROCMLINEAR
                value: "1"
              - name: NCCL_SOCKET_IFNAME
                value: "rdma0"
              - name: MASTER_ADDRESS
                valueFrom:
                  configMapKeyRef:
                    key: master_addr
                    name: sdxl-finetune-multinode-config
              - name: MASTER_PORT
                valueFrom:
                  configMapKeyRef:
                    key: master_port
                    name: sdxl-finetune-multinode-config
              - name: NCCL_IB_HCA
                value: "mlx5_0,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_7,mlx5_8,mlx5_9"
              - name: HEADLESS_SVC
                valueFrom:
                  configMapKeyRef:
                    key: headless_svc
                    name: sdxl-finetune-multinode-config
              - name: NNODES
                valueFrom:
                  configMapKeyRef:
                    key: num_replicas
                    name: sdxl-finetune-multinode-config
              - name: NODE_RANK
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
              - name: WANDB_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: wandb-secret
                    key: WANDB_API_KEY
              - name: HF_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: HF_TOKEN
              volumeMounts:
                - mountPath: /mnt
                  name: model-weights-volume
                - mountPath: /etc/config
                  name: diffusers-config-volume
                - { mountPath: /dev/infiniband, name: devinf }
                - { mountPath: /dev/shm, name: shm }
              resources:
                requests:
                  amd.com/gpu: 8 
                limits:
                  amd.com/gpu: 8 
              command: ["/bin/bash", "-c", "--"]
              args:
                - |
                  # Clone the GitHub repo
                  git clone --recurse https://github.com/ROCm/bitsandbytes.git
                  cd bitsandbytes
                  git checkout rocm_enabled
                  # Install dependencies
                  pip install -r requirements-dev.txt
                  # Use -DBNB_ROCM_ARCH to specify target GPU arch
                  cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
                  make
                  pip install .
                  cd .. 
    
                  # Set up Hugging Face authentication using the secret
                  mkdir -p ~/.huggingface
                  echo $HF_TOKEN > ~/.huggingface/token
                  
                  pip install deepspeed==0.14.5 wandb
                  git clone https://github.com/huggingface/diffusers && 
                  cd diffusers && pip install -e . && cd examples/text_to_image &&
                  pip install -r requirements_sdxl.txt
                  
                  export EXP_DIR=./output
                  mkdir -p output
                  LOG_FILE="${EXP_DIR}/sdxl_$(date '+%Y-%m-%d_%H-%M-%S')_MI300_SDXL_FINETUNE.log"
                  export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
                  export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
                  export DATASET_NAME="lambdalabs/naruto-blip-captions"
    
                  export ACCELERATE_CONFIG_FILE="/etc/config/default_config_accelerate.yaml"
                  export HF_HOME=/mnt/huggingface
                  accelerate launch --config_file $ACCELERATE_CONFIG_FILE \
                    --main_process_ip $MASTER_ADDRESS \
                    --main_process_port $MASTER_PORT \
                    --machine_rank $NODE_RANK \
                    --num_processes $((8 * NNODES)) \
                    --num_machines $NNODES train_text_to_image_sdxl.py \
                    --pretrained_model_name_or_path=$MODEL_NAME \
                    --pretrained_vae_model_name_or_path=$VAE_NAME \
                    --dataset_name=$DATASET_NAME \
                    --resolution=512 --center_crop --random_flip \
                    --proportion_empty_prompts=0.1 \
                    --train_batch_size=12 \
                    --gradient_checkpointing \
                    --num_train_epochs=500 \
                    --use_8bit_adam \
                    --learning_rate=1e-04 --lr_scheduler="cosine" --lr_warmup_steps=200 \
                    --mixed_precision="fp16" \
                    --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 20 \
                    --checkpointing_steps=1000 \
                    --report_to="wandb" \
                    --output_dir="sdxl-naruto-model" 2>&1 | tee "$LOG_FILE"
                  sleep 30m
          volumes:
            - name: model-weights-volume
              hostPath:
                path: /mnt/model_weights
                type: Directory
            - name: diffusers-config-volume
              configMap:
                name: accelerate-config
            - { name: devinf, hostPath: { path: /dev/infiniband }}
            - { name: shm, emptyDir: { medium: Memory, sizeLimit: 512Gi }}
          restartPolicy: Never
          subdomain: sdxl-headless-svc
    

Please note that you can change the num_replicas, completions, and parallelism values in the accelerate_blog_multinode.yaml file to any number between 1 and the total number of nodes your cluster has, to specify how many nodes to use for the multinode workload. Currently, we are using 3, meaning we are implementing fine-tuning on 3 nodes, each with 8 GPUs. You will need to modify the configuration to match your actual infrastructure. For instance, if your nodes only have 4 GPUs each, you’ll need to change the number of GPUs requested per job (resources->requests->amd.com/gpu:) to 4, ensuring the correct number of nodes is allocated.

The sleep 30m command at the end of the Job section under spec -> template -> spec ->containers -> args keeps the job running for an additional 30 minutes after it completes. This gives you time to review the results and troubleshoot any issues. You can adjust the sleep time or remove this command as needed.

Summary#

In this blog post we showed you how to set-up and fine-tune a generative AI model on Oracle Cloud Infrastructure’s (OCI) Oracle Kubernetes Engine (OKE) using a cluster of AMD GPUs. You can use this tutorial as a starting point and adjust the YAML file to reflect your own network resources and the specific needs of your particular task.

Acknowledgment#

We want to thank the OCI team for helping us set up the multinode environment to implement the workload.

Disclaimers#

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.