Posts by Douglas Jia
Multinode Fine-Tuning of Stable Diffusion XL on AMD GPUs with Hugging Face Accelerate and OCI’s Kubernetes Engine (OKE)
- 15 October 2024
As the scale and complexity of generative AI and deep learning models grow, multinode training, basically dividing a training job across several processors, has become an essential strategy to speed up training and fine-tuning processes of large generative AI models like SDXL. By distributing the training workload across multiple GPUs on multiple nodes, multinode setups can significantly accelerate the training process. In this blog post we will show you, step-by step, how to set-up and fine-tune a Stable Diffusion XL (SDXL) model in a multinode Oracle Cloud Infrastructure’s (OCI) Kubernetes Engine (OKE) on AMD GPUs using ROCm.
Leaner LLM Inference with INT8 Quantization on AMD GPUs using PyTorch
- 03 October 2024
With the scale of large language models (LLMs) reaching hundred of billions of parameters, the ways we represent data within these enormous models dramatically impacts the resources required to train them (e.g. the number of GPUs needed for inference). In our previous blogs (JAX mixed precision training; PyTorch AMP), we already demonstrated how mixed precision training can accelerate LLMs training process. In this blog post we will push things further and show you how quantization into an even lower precision data formats can speed up inference, saving time and memory, without sacrificing the overall performance of the model. Quantization is a technique where the precision of a model’s parameters is reduced from a 32-bit floating point (FP32) or a 16-bit floating point (FP16) to an 8-bit integer (INT8). Standard models typically use 32-bit floating-point (FP32) precision. However, this higher precision is not always necessary for inference tasks. By converting model weights and activations to lower precision formats like INT8 (8-bit integer), we can achieve faster computations and lower memory usage, effectively reducing the model size by three-fourths (from 32-bit) or half (from 16-bit) with only a slight accuracy reduction, which is often outweighed by the speed gains.
Optimize GPT Training: Enabling Mixed Precision Training in JAX using ROCm on AMD GPUs
- 06 September 2024
This blog builds on the nanoGPT model we discussed in A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs. Here we will show you how to incorporate mixed precision training to the JAX-implemented nanoGPT model we discussed in our previous blog.
Using statistical methods to reliably compare algorithm performance in large generative AI models with JAX Profiler on AMD GPUs
- 22 July 2024
This blog provides a comprehensive guide on measuring and comparing the performance of various algorithms in a JAX-implemented generative AI model. Leveraging the JAX Profiler and statistical analysis, this blog demonstrates how to reliably evaluate key steps and compare algorithm performance on AMD GPUs.
A Guide to Implementing and Training Generative Pre-trained Transformers (GPT) in JAX on AMD GPUs
- 02 July 2024
2 July, 2024 by Douglas Jia.
Transforming Words into Motion: A Guide to Video Generation with AMD GPU
- 24 April 2024
24 Apr, 2024 by Douglas Jia.
Instruction fine-tuning of StarCoder with PEFT on multiple AMD GPUs
- 16 April 2024
16 Apr, 2024 by Douglas Jia.
GPU Unleashed: Training Reinforcement Learning Agents with Stable Baselines3 on an AMD GPU in Gymnasium Environment
- 11 April 2024
11 Apr, 2024 by Douglas Jia.
Efficient image generation with Stable Diffusion models and ONNX Runtime using AMD GPUs
- 23 February 2024
23 Feb, 2024 by Douglas Jia.
Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs
- 24 January 2024
24 Jan, 2024 by Douglas Jia.
Efficient image generation with Stable Diffusion models and AITemplate using AMD GPUs
- 24 January 2024
24 Jan, 2024 by Douglas Jia.
Efficient deployment of large language models with Text Generation Inference on AMD GPUs
- 24 January 2024
24 Jan, 2024 by Douglas Jia.