Posts by Rishi Madduri
Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs
- 23 March 2025
Training massive deep-learning models requires a balance of efficiency and scalability. In the context of the Transformers architecture, Mixture of Experts (MoE) models are massive machine learning architectures characterized for dividing tasks among multiple specialized sub-networks or “experts”. A gating network determines the expert to which a given input should be routed, enabling the model to handle complex tasks more efficiently by using the specialized capabilities of each expert. This dynamic routing mechanism allows MoE models to scale efficiently, activating only a subset of the network for each input, therefore reducing computational load while maintaining high model capacity.