Posts by Jason Furmanek

OpenXLA and JAX - ROCm Support and the State of CI

29 June 2026

The OpenXLA compiler stack — XLA at the foundation, JAX as the front end — now runs upstream on AMD ROCm. XLA gates every pull request on real AMD Instinct silicon through its GitHub Actions workflow, side by side with the CUDA path; JAX runs the same hardware on every ROCm PR through its own workflows, with the merge gate rolling out next. pip install "jax[rocm7-local]" is a first-class entry point. This post documents how that backend is structured, what landed in the last twelve months, and how the CI pipeline that keeps it healthy is wired together. Part 1 covers OpenXLA on AMD — the XLA backend, what landed this year, and CI. Part 2 covers JAX on AMD — the plugin architecture, JAX-side changes, and the four-workflow test matrix.

Read more ...

From Naive to Near-Peak: Building High-Performance GEMM Kernels with Gluon

22 May 2026

On a single MI355, our most-optimized FP16 GEMM kernel runs at 99% MFMA efficiency — the matrix engine sits idle for a handful of cycles per loop. Getting there took ten versions, a regression along the way, and a profiler open for the whole time. This post is a tour of that path: from a 520 TFLOPS naive baseline to a 1489 TFLOPS near-peak kernel (~3× speedup), then the same design carried forward to BF8 (3257 TFLOPS, 99.72%) and MXFP4 (5255 TFLOPS, 92.41%) for low-precision AI workloads.

Read more ...