Posts by Liying Li

Dropless MoE Training in JAX with Primus-Turbo

Mixture-of-Experts (MoE) models have become a standard way to scale a transformer’s parameter count without paying the full compute bill — but training them efficiently on GPUs forces an uncomfortable trade-off. The default path in JAX/MaxText keeps every expert’s tensors at a fixed shape and simply drops the tokens that overflow each expert’s capacity, trading model quality for speed. The fully dropless alternative keeps every token, but in pure JAX it hits a memory wall that makes it impractical at production scale.

Read more ...


MoE Training Best Practices on AMD GPUs

This blog covers best practices for training Mixture-of-Experts (MoE) models on AMD Instinct™ MI300/MI355-series[a] GPUs with the ROCm ecosystem. Whether you’re new to MoE distributed architectures or optimizing trillion-parameter models, this guide will help you identify bottlenecks and maximize efficiency on AMD hardware.

Read more ...