Posts by Clint Greene

Accelerating Large Language Models with Flash Attention on AMD GPUs

In this blog post, we will guide you through the process of installing Flash Attention on AMD GPUs and provide benchmarks comparing its performance to standard SDPA in PyTorch. We will also measure end-to-end prefill latency for multiple Large Language Models (LLMs) in Hugging Face.

Read more ...


Inferencing with Mixtral 8x22B on AMD GPUs

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Speech-to-Text on an AMD GPU with Whisper

Whisper is an advanced automatic speech recognition (ASR) system, developed by OpenAI. It employs a straightforward encoder-decoder Transformer architecture where incoming audio is divided into 30-second segments and subsequently fed into the encoder. The decoder can be prompted with special tokens to guide the model to perform tasks such as language identification, transcription, and translation.

Read more ...


Developing Triton Kernels on AMD GPUs

OpenAI has developed a powerful GPU focused programming language and compiler called Triton that works seamlessly with AMD GPUs. The goal of Triton is to enable AI engineers and scientists to write high-performant GPU code with minimal expertise. Triton kernels are performant because of their blocked program representation, allowing them to be compiled into highly optimized binary code. Triton also leverages Python for kernel development, making it both familiar and accessible. And the kernels can be easily compiled by simply declaring the triton.jit python decorator before the kernel.

Read more ...


Retrieval Augmented Generation (RAG) using LlamaIndex

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Inferencing and serving with vLLM on AMD GPUs

top-level ‘html_meta’ key is deprecated, place under ‘myst’ key instead [myst.topmatter]

Read more ...


Accelerating XGBoost with Dask using multiple AMD GPUs

XGBoost is an optimized library for distributed gradient boosting. It has become the leading machine learning library for solving regression and classification problems. For a deeper dive into how gradient boosting works, we recommend reading Introduction to Boosted Trees.

Read more ...