Posts by Andy Ye

Dropless MoE Training in JAX with Primus-Turbo

Mixture-of-Experts (MoE) models have become a standard way to scale a transformer’s parameter count without paying the full compute bill — but training them efficiently on GPUs forces an uncomfortable trade-off. The default path in JAX/MaxText keeps every expert’s tensors at a fixed shape and simply drops the tokens that overflow each expert’s capacity, trading model quality for speed. The fully dropless alternative keeps every token, but in pure JAX it hits a memory wall that makes it impractical at production scale.

Read more ...


MaxText-Slurm: Production-Grade LLM Training with Built-In Observability

Training large language models (LLMs) at scale on GPU clusters is not just a compute problem — it is an operations problem. Launching multi-node distributed training, keeping it running reliably, and diagnosing failures when they happen all require tooling that most training frameworks do not provide. MaxText-Slurm is an open-source launch system and observability stack that bridges this gap for MaxText on AMD Instinct GPU clusters managed by Slurm.

Read more ...


Optimizing LLM Workloads: AMD Instinct MI355X GPUs Drive Competitive Performance

AI training workloads are pushing the limits of modern GPU architectures. With the release of AMD ROCm™ 7.0 software, AMD is raising the bar for high-performance training by delivering optimized support for LLM workloads across the JAX and PyTorch frameworks. The latest v25.9 Training Dockers demonstrate exceptional scaling efficiency for both single-node and multi-node setups, empowering researchers and developers to push model sizes and complexity further than ever.

Read more ...