<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <id>https://rocm.blogs.amd.com/</id>
  <title>AMD ROCm Blogs</title>
  <updated>2026-03-09T15:59:09.713320+00:00</updated>
  <link href="https://rocm.blogs.amd.com/"/>
  <link href="https://rocm.blogs.amd.com/blog/atom.xml" rel="self"/>
  <generator uri="https://ablog.readthedocs.io/" version="0.11.12">ABlog</generator>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/comfyui-radeon-9000/README.html</id>
    <title>Getting Started with ComfyUI on AMD Radeon™ RX 9000 Series GPUs</title>
    <updated>2026-03-09T00:00:00+00:00</updated>
    <author>
      <name>George Wang</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;ComfyUI has become a widely adopted and versatile node-based interface for Stable Diffusion and other generative AI models, gaining significant traction within the AI content creation community. Unlike traditional web-based interfaces, ComfyUI provides a node-based workflow system that gives users complete control over their image and video generation pipelines. Its modular architecture allows for complex workflows involving multiple models, LoRAs, ControlNets, and custom processing steps.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/comfyui-radeon-9000/README.html"/>
    <summary>ComfyUI has become a widely adopted and versatile node-based interface for Stable Diffusion and other generative AI models, gaining significant traction within the AI content creation community. Unlike traditional web-based interfaces, ComfyUI provides a node-based workflow system that gives users complete control over their image and video generation pipelines. Its modular architecture allows for complex workflows involving multiple models, LoRAs, ControlNets, and custom processing steps.</summary>
    <category term="DiffusionModel" label="Diffusion Model"/>
    <category term="GenAI" label="GenAI"/>
    <category term="Installation" label="Installation"/>
    <category term="Memory" label="Memory"/>
    <category term="Optimization" label="Optimization"/>
    <category term="Performance" label="Performance"/>
    <category term="PyTorch" label="PyTorch"/>
    <published>2026-03-09T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm-agentic-diagnosis/README.html</id>
    <title>Agentic Diagnosis for LLM Training at Scale</title>
    <updated>2026-03-09T00:00:00+00:00</updated>
    <author>
      <name>Zhenyu Gu</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;In &lt;a class="reference external" href="https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm/README.html"&gt;MaxText-Slurm: Production-Grade LLM Training with Built-In Observability&lt;/a&gt;, we introduced &lt;a class="reference external" href="https://github.com/AMD-AGI/maxtext-slurm"&gt;MaxText-Slurm&lt;/a&gt; — an open-source launch system and observability stack for running MaxText LLM training on AMD Instinct GPU clusters. We showed how a unified Prometheus time-series database (TSDB) collects GPU, host, network, and training metrics into a single queryable store, persisted to disk so that no data is lost even if the job crashes.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm-agentic-diagnosis/README.html"/>
    <summary>In MaxText-Slurm: Production-Grade LLM Training with Built-In Observability, we introduced MaxText-Slurm — an open-source launch system and observability stack for running MaxText LLM training on AMD Instinct GPU clusters. We showed how a unified Prometheus time-series database (TSDB) collects GPU, host, network, and training metrics into a single queryable store, persisted to disk so that no data is lost even if the job crashes.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="AgenticWorkflows" label="Agentic Workflows"/>
    <category term="HPC" label="HPC"/>
    <category term="IncidentDiagnosis" label="Incident Diagnosis"/>
    <category term="JAX" label="JAX"/>
    <category term="LLM" label="LLM"/>
    <category term="Observability" label="Observability"/>
    <published>2026-03-09T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/hpc-agent-profile/README.html</id>
    <title>HPC Coding Agent - Part 3: MCP Tool for Profiling</title>
    <updated>2026-03-06T00:00:00+00:00</updated>
    <author>
      <name>Arttu Niemela</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;In this blog, we build an AI agent specialized in profiling and optimizing GPU-accelerated applications within High-Performance Computing (HPC) environments. Using open-source tools, we create a state-of-the-art agent and enhance its profiling capabilities through a custom Model Context Protocol (MCP) server. This server provides the agent with tools to leverage AMD’s profiling utilities for analyzing application performance on AMD GPUs.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/hpc-agent-profile/README.html"/>
    <summary>In this blog, we build an AI agent specialized in profiling and optimizing GPU-accelerated applications within High-Performance Computing (HPC) environments. Using open-source tools, we create a state-of-the-art agent and enhance its profiling capabilities through a custom Model Context Protocol (MCP) server. This server provides the agent with tools to leverage AMD’s profiling utilities for analyzing application performance on AMD GPUs.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="HPC" label="HPC"/>
    <category term="Optimization" label="Optimization"/>
    <category term="Profiling" label="Profiling"/>
    <published>2026-03-06T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/walrus-finetuning/README.html</id>
    <title>Fine-Tuning AI Surrogate Models for Physics Simulations with Walrus on AMD Instinct GPU Accelerators</title>
    <updated>2026-03-06T00:00:00+00:00</updated>
    <author>
      <name>Luka Tsabadze</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;Physics simulations are used for studying complex systems and are essential where experiments are difficult, expensive, or impossible. In our context, a simulation means numerically solving mathematical equations that are believed to describe a physical system and evolving them forward in time on a computer. They enable controlled exploration of physical behavior for science and engineering, but at a high computational cost, which in most cases increases rapidly with scale. Our focus is on continuum dynamics, where the system is represented by fields such as density, velocity, or temperature, defined on a grid and evolving over time. High-resolution physics simulations are slow to run, sensitive to numerical error and impractical for large parameter spaces. Surrogate models address these limitations by learning to approximate simulation dynamics directly from data. Once trained, they can produce fast predictions at a fraction of the cost, giving researchers the ability to rapidly explore parameter space and generate long rollouts.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/walrus-finetuning/README.html"/>
    <summary>Physics simulations are used for studying complex systems and are essential where experiments are difficult, expensive, or impossible. In our context, a simulation means numerically solving mathematical equations that are believed to describe a physical system and evolving them forward in time on a computer. They enable controlled exploration of physical behavior for science and engineering, but at a high computational cost, which in most cases increases rapidly with scale. Our focus is on continuum dynamics, where the system is represented by fields such as density, velocity, or temperature, defined on a grid and evolving over time. High-resolution physics simulations are slow to run, sensitive to numerical error and impractical for large parameter spaces. Surrogate models address these limitations by learning to approximate simulation dynamics directly from data. Once trained, they can produce fast predictions at a fraction of the cost, giving researchers the ability to rapidly explore parameter space and generate long rollouts.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="Fine-Tuning" label="Fine-Tuning"/>
    <category term="PyTorch" label="PyTorch"/>
    <category term="ScientificComputing" label="Scientific Computing"/>
    <published>2026-03-06T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/stormcast-ensembles/README.html</id>
    <title>Ensemble High-Resolution Weather Forecasting on AMD Instinct GPU Accelerators</title>
    <updated>2026-03-06T00:00:00+00:00</updated>
    <author>
      <name>Pauli Pihajoki</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;Weather prediction is fraught with uncertainty, as is the inference of any real-world
phenomena dependent on physical observations. The consequence is that any
estimated current state of the atmosphere as well as any forecast both carry a
level of uncertainty. As such, any weather forecasting model, whether AI or
traditional, needs to produce reasonable outputs despite the inherent
uncertainty of inputs, and, if possible, quantify the uncertainty of the outputs
for the user in some practical fashion.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/stormcast-ensembles/README.html"/>
    <summary>Weather prediction is fraught with uncertainty, as is the inference of any real-world
phenomena dependent on physical observations. The consequence is that any
estimated current state of the atmosphere as well as any forecast both carry a
level of uncertainty. As such, any weather forecasting model, whether AI or
traditional, needs to produce reasonable outputs despite the inherent
uncertainty of inputs, and, if possible, quantify the uncertainty of the outputs
for the user in some practical fashion.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="DiffusionModel" label="Diffusion Model"/>
    <category term="GenAI" label="GenAI"/>
    <category term="ScientificComputing" label="Scientific Computing"/>
    <published>2026-03-06T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/hpc-agent-openevolve/README.html</id>
    <title>HPC Coding Agent - Part 2: An MCP Tool for Code Optimization with OpenEvolve</title>
    <updated>2026-03-04T00:00:00+00:00</updated>
    <author>
      <name>Johanna Yang</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;Large language models (LLMs) and LLM-driven agents (AI agents) are already trained on a massive amount of data where a considerable portion consists of code, and both models and agentic coding services are developed specifically for the purpose of coding. For users who want to optimize their code for certain purposes, for example runtime or memory efficiency, LLMs may produce plausible solutions, but these are often not optimal.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/hpc-agent-openevolve/README.html"/>
    <summary>Large language models (LLMs) and LLM-driven agents (AI agents) are already trained on a massive amount of data where a considerable portion consists of code, and both models and agentic coding services are developed specifically for the purpose of coding. For users who want to optimize their code for certain purposes, for example runtime or memory efficiency, LLMs may produce plausible solutions, but these are often not optimal.</summary>
    <category term="AgenticCoding" label="Agentic Coding"/>
    <category term="Agents" label="Agents"/>
    <category term="GenAI" label="GenAI"/>
    <category term="LLM" label="LLM"/>
    <category term="OpenEvolve" label="OpenEvolve"/>
    <category term="Serving" label="Serving"/>
    <published>2026-03-04T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/recsys-training-docker/README.html</id>
    <title>Streamlining Recommendation Model Training on AMD Instinct™ GPUs</title>
    <updated>2026-03-02T00:00:00+00:00</updated>
    <author>
      <name>Steve Reinhardt</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;Recommendation model training and inference workloads represent a
significant portion of computational requirements across industries
including e-commerce, social media and content streaming platforms.
Unlike LLMs, recommendation models result in to complex and often imbalanced
communication across GPUs, along with a higher load on the CPU-GPU
interconnect. The &lt;a class="reference external" href="https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.html?model=pyt_train_dlrm"&gt;ROCm training
docker&lt;/a&gt; &lt;a class="reference internal" href="#ref1"&gt;&lt;span class="xref myst"&gt;[1]&lt;/span&gt;&lt;/a&gt;
now includes essential libraries for recommendation model training. This
blog demonstrates the functionality and ease of training recommendation
models using ROCm, along with suggestions for improved configuration of
these workloads. We also highlight the inherent benefits of the large
HBM size on AMD Instinct™ GPUs for recommendation workloads.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/recsys-training-docker/README.html"/>
    <summary>Recommendation model training and inference workloads represent a
significant portion of computational requirements across industries
including e-commerce, social media and content streaming platforms.
Unlike LLMs, recommendation models result in to complex and often imbalanced
communication across GPUs, along with a higher load on the CPU-GPU
interconnect. The ROCm training
docker [1]
now includes essential libraries for recommendation model training. This
blog demonstrates the functionality and ease of training recommendation
models using ROCm, along with suggestions for improved configuration of
these workloads. We also highlight the inherent benefits of the large
HBM size on AMD Instinct™ GPUs for recommendation workloads.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="RecommendationSystems" label="Recommendation Systems"/>
    <published>2026-03-02T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm/README.html</id>
    <title>MaxText-Slurm: Production-Grade LLM Training with Built-In Observability</title>
    <updated>2026-03-02T00:00:00+00:00</updated>
    <author>
      <name>Zhenyu Gu</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;Training large language models (LLMs) at scale on GPU clusters is not just a compute problem — it is an operations problem. Launching multi-node distributed training, keeping it running reliably, and diagnosing failures when they happen all require tooling that most training frameworks do not provide. &lt;a class="reference external" href="https://github.com/AMD-AGI/maxtext-slurm"&gt;MaxText-Slurm&lt;/a&gt; is an open-source launch system and observability stack that bridges this gap for &lt;a class="reference external" href="https://github.com/AI-Hypercomputer/maxtext"&gt;MaxText&lt;/a&gt; on AMD Instinct GPU clusters managed by Slurm.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/software-tools-optimization/maxtext-slurm/README.html"/>
    <summary>Training large language models (LLMs) at scale on GPU clusters is not just a compute problem — it is an operations problem. Launching multi-node distributed training, keeping it running reliably, and diagnosing failures when they happen all require tooling that most training frameworks do not provide. MaxText-Slurm is an open-source launch system and observability stack that bridges this gap for MaxText on AMD Instinct GPU clusters managed by Slurm.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="HPC" label="HPC"/>
    <category term="JAX" label="JAX"/>
    <category term="LLM" label="LLM"/>
    <category term="Performance" label="Performance"/>
    <category term="Profiling" label="Profiling"/>
    <published>2026-03-02T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/rocm7-ray/README.html</id>
    <title>Exploring Use Cases for Scalable AI: Implementing Ray with ROCm 7 Support for Efficient ML Workflows</title>
    <updated>2026-02-27T00:00:00+00:00</updated>
    <author>
      <name>Vish Vadlamani</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;This blog builds on insights from our &lt;a class="reference external" href="https://rocm.blogs.amd.com/artificial-intelligence/rocm-ray/README.html"&gt;previous blog post&lt;/a&gt;, which introduced Ray 2.48.0.post0 running on ROCm 6.2 and demonstrated Reinforcement Learning from Human Feedback (RLHF) with verl 0.3.0.post0 and vLLM 0.6.4 on AMD GPUs. In this follow‑up, we introduce Ray 2.51.1 with ROCm 7.0.0, verl 0.6.0, and vLLM 0.11.0.dev, highlighting the new performance benefits and capabilities for large‑scale RLHF workloads.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/rocm7-ray/README.html"/>
    <summary>This blog builds on insights from our previous blog post, which introduced Ray 2.48.0.post0 running on ROCm 6.2 and demonstrated Reinforcement Learning from Human Feedback (RLHF) with verl 0.3.0.post0 and vLLM 0.6.4 on AMD GPUs. In this follow‑up, we introduce Ray 2.51.1 with ROCm 7.0.0, verl 0.6.0, and vLLM 0.11.0.dev, highlighting the new performance benefits and capabilities for large‑scale RLHF workloads.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="Fine-Tuning" label="Fine-Tuning"/>
    <category term="GenAI" label="GenAI"/>
    <category term="ReinforcementLearning" label="Reinforcement Learning"/>
    <category term="Serving" label="Serving"/>
    <published>2026-02-27T00:00:00+00:00</published>
  </entry>
  <entry>
    <id>https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop-offline/README.html</id>
    <title>PyTorch Offline Tuning with TunableOp</title>
    <updated>2026-02-24T00:00:00+00:00</updated>
    <author>
      <name>Jin Zhou</name>
    </author>
    <content type="html">&lt;p class="ablog-post-excerpt"&gt;&lt;p&gt;In an &lt;a class="reference external" href="https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop/README.html"&gt;earlier blog post,&lt;/a&gt; we explored how PyTorch TunableOp can &lt;em&gt;potentially&lt;/em&gt; accelerate models through &lt;strong&gt;online tuning&lt;/strong&gt; - where during model execution, PyTorch benchmarks and selects optimal BLAS kernels. While online tuning is effective, it introduces overhead due to the time needed to execute the ML model from end-to-end. If this is done once, the overhead may be acceptable, but for repeated tuning it may be cost-prohibitive to keep re-running the model.&lt;/p&gt;
&lt;/p&gt;
</content>
    <link href="https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop-offline/README.html"/>
    <summary>In an earlier blog post, we explored how PyTorch TunableOp can potentially accelerate models through online tuning - where during model execution, PyTorch benchmarks and selects optimal BLAS kernels. While online tuning is effective, it introduces overhead due to the time needed to execute the ML model from end-to-end. If this is done once, the overhead may be acceptable, but for repeated tuning it may be cost-prohibitive to keep re-running the model.</summary>
    <category term="AI/ML" label="AI/ML"/>
    <category term="LinearAlgebra" label="Linear Algebra"/>
    <category term="Optimization" label="Optimization"/>
    <category term="Performance" label="Performance"/>
    <category term="PyTorch" label="PyTorch"/>
    <published>2026-02-24T00:00:00+00:00</published>
  </entry>
</feed>
