Introduction to profiling tools for AMD hardware#

Introduction to profiling tools for AMD hardware
April 10, 2026 by Gina Sitaraman, Thomas Gibson, Noah Wolfe.
7 min read. | 1637 total words.

Getting a code to be functionally correct is not always enough. In many industries, it is also required that applications and their complex software stack run as efficiently as possible to meet operational demands. This is particularly challenging as hardware continues to evolve over time, and as a result codes may require further tuning. In practice, many application developers construct benchmarks, which are carefully designed to measure the performance, such as execution time, of a particular code within an operational-like setting. In other words: a good benchmark should be representative of the real work that needs to be done. These benchmarks are useful in that they provide insight into the characteristics of the application, and enable one to discover potential bottlenecks that could result in performance degradation during operational settings.

At face value, benchmarking sounds simple enough and is often interpreted as simply a comparison of execution time on a variety of different machines. However, in order to extract the most performance from emerging hardware, the program must be tuned many times and requires more than measuring raw execution time: one needs to know where the program is spending most of its time and whether further improvements can be made. Heterogeneous systems, where programs run on both CPUs and GPUs, introduce additional complexities. Understanding the critical path and kernel execution is all the more important. Thus, performance tuning is a necessary component in the benchmarking process.

With AMD’s profiling tools, developers are able to gain important insight into how efficiently their application is utilizing hardware and effectively diagnose potential bottlenecks contributing to poor performance. Developers targeting AMD GPUs have multiple tools available depending on their specific profiling needs. This post serves as an introduction to the various profiling tools offered by AMD and why a developer might leverage one over the other. This post covers everything from low-level profiling tools to extensive profiling suites.

In this introductory blog, we briefly describe the following tools that can aid in application analysis:

  1. rocprofiler-sdk and rocprofv3

  2. rocprof-sys

  3. rocprof-compute

  4. Radeon™ GPU Profiler

  5. AMD uProf

  6. Other Third Party Tools

Terminology#

The following terms are used in this blog post:

Term

Description

AMD “Zen” Core

AMD’s x86-64 processor core architecture design. Used by the AMD EPYC™ server processors and AMD Ryzen™, AMD Ryzen™ PRO, and AMD Threadripper™ PRO processor series.

RDNA™

AMD’s GPU architecture optimized for graphically demanding workloads in gaming, visualization, and professional applications. Includes the AMD Radeon™ PRO W6000, W7000 series and Radeon™ AI PRO R9000 series (RDNA 3 and RDNA 4 architectures).

CDNA™

AMD’s Compute dedicated GPU architecture optimized for accelerating HPC, ML/AI, and data center type workloads. Includes the AMD Instinct™ MI100, MI200, MI300, and MI350 series accelerators.

HIP

A C++ Runtime API and kernel language that allows developers to create portable compute kernels/applications for AMD and NVIDIA GPUs from a single source code

Timeline Trace

A profiling approach where duration of compute kernels and data transfers between devices are collected and visualized

Roofline Analysis

Hardware agnostic methodology for quantifying a workload’s ability to saturate the given compute architecture in terms of floating-point compute and memory bandwidth

Hardware Counters

Individual metrics which track how many times a certain event occurs in the hardware, such as bytes moved from L2 cache or a 32 bit floating point add performed

What tools to use?#

The first step in profiling is determining the right tool for the job. Whether one wants to collect traces on the CPU, GPU, or both, understand kernel behavior, or assess memory access patterns, performing such an analysis might appear daunting for new users of AMD hardware. We begin by identifying the architecture and operating systems supported by each of the profiling tools provided by AMD. Almost all the tools in Table 1 support Linux® distros and with the gaining popularity of Instinct™ GPUs, every tool has some capability to profile codes running on CDNA™ architecture. However, those who prefer Windows will be limited to using AMD uProf to profile CPU and GPU codes targeting AMD “Zen”-based processors and AMD Instinct™ GPUs, and Radeon™ GPU Profiler that can provide great insights to optimize applications’ use of the graphics pipeline (rasterization, shaders, etc.) on RDNA™-based GPUs.

AMD Profiling Tools

AMD “Zen” Core

RDNA™

CDNA™

Windows

Linux®

rocprofv3 / rocprofiler-sdk

Not supported

Not supported

rocprof-sys

Not supported

rocprof-compute

Not supported

Not supported

Not supported

Radeon™ GPU Profiler

Not supported

AMD uProf

Not supported

★ Full support | ☆ Partial support

Table 1: Profiler/architecture support and operating system needs.

The final choice of the tool on any platform depends on the profiling objective and the kind of analysis required. To make it simpler, we encourage the users to think of their objectives in terms of three questions as depicted in the flow diagram in Figure 1:

  1. Where should I focus my time? : Whether benchmarking a new application, or getting started with a new software package that has not yet been profiled, it is recommended to first identify hot spots in the application that may benefit from quick optimization. In such a scenario, it is best if users start by collecting timelines and traces of their application. On Linux® platforms, rocprof-sys enables the collection of CPU and GPU traces, and call stack samples to help identify major hot spots. For GPU-only traces, rocprofv3 provides a lightweight and powerful alternative. However, on Windows, one may have to choose between AMD uProf and Radeon™ GPU Profiler depending on the targeted architecture.

  2. How well am I using the hardware?: The first step is to obtain a characterization of workloads that can provide a glimpse into how well the hardware is being utilized. For example, identifying what parts of your application are memory or compute bound. This can be accomplished through roofline profiling. Typically, hot spots are well understood and interest is usually in identifying the performance of a few key kernels or subroutines. At present, roofline profiling is only available through rocprof-compute on AMD Instinct™ GPUs and AMD uProf on AMD “Zen”-based processors.

  3. Why am I seeing this performance?: Once hot spots are identified and the initial assessment of performance is completed, the next phase likely involves profiling and collecting the hardware metrics to understand where the observed performance is coming from. On AMD GPUs, tools such as rocprofv3, rocprof-sys, rocprof-compute, and AMD uProf use APIs from the rocprofiler-sdk library to gather GPU metrics. On Windows systems, one will have to rely on using either AMD uProf or Radeon™ GPU Profiler.

Quick Tip: The ROCm profiling tools (rocprofv3, rocprof-compute, and rocprof-sys), available on Linux® platforms, provide an easy-to-use interface for studying performance of the code across AMD hardware and should be treated as “go-to” profiling tools for performance tuning and benchmarking.

../../_images/when-to-use-diagram.png

Figure 1: Use cases for a variety of AMD profiling tools.


Overview of profiling tools#

In this section, we provide a brief overview of the above-mentioned AMD tools and some third-party toolkits.

rocprofiler-sdk and rocprofv3#

The rocprofiler-sdk library is the foundation of AMD GPU profiling and tracing infrastructure, shipped with ROCm™. It provides the low-level API for device activity tracing and hardware performance counter collection. This library replaces the functionality of the legacy rocprofiler and roctracer libraries.

rocprofv3 is the command-line tool built on rocprofiler-sdk and replaces the legacy rocprof and rocprofv2 tools. Some of the most useful features of rocprofv3 are:

  • Collect a variety of traces for ROCm-based applications (HIP API, HSA API, offloaded kernels, memory copies, scratch memory, marker API, etc.)

  • Collect device hardware counters for GPU kernel performance analysis

  • Find GPU hot spots quickly

  • Profile Python workloads efficiently

Starting in ROCm 7.0, rocprofv3 defaults to writing profiling data to a SQLite3-based database, and a new companion tool called rocpd is available to generate CSV, OTF2, and/or Perfetto trace files from the collected database or generate summaries of profiled regions. This “profile once, analyze many times” workflow enables efficient post-processing without re-running the application.

We strongly urge you to use rocprofv3 as the legacy rocprof tool will be deprecated in future ROCm releases.

For more details, see the rocprofv3 documentation and the rocprofiler-sdk library documentation.

rocprof-sys#

The ROCm systems profiler, also known as rocprof-sys, is best suited to collect host, device, and communication (MPI) activity in one comprehensive, unified trace of your application’s run. rocprof-sys uses libraries that tools like amd-smi and perf are built upon to provide a holistic view of the system where your application ran.

With configurable runtime options, you can use it for call-stack sampling, binary instrumentation, causal profiling, and hardware counter collection. Simply put, this is the tool you would use if you want to understand what is happening in your application on the host and on the device. The output trace is in protobuf format (.proto) for easy visualization using Perfetto UI in a browser.

rocprof-sys has evolved from the former Omnitrace research tool from AMD. It now supports the rocprofiler-sdk library, bringing advanced capabilities such as OMPT support for tracing Fortran codes with OpenMP® offload, writing to the rocpd output format, and network performance profiling.

When analyzing the performance of an application, it is always best to NOT assume you know where the performance bottlenecks are and why they are happening. rocprof-sys is the ideal tool for characterizing where optimization would have the greatest impact on the end-to-end execution of the application and/or viewing what else is happening on the system during a performance bottleneck.

../../_images/rocprof-sys-timeline.png

Figure 2: rocprof-sys timeline trace example.

Please see the official rocprof-sys documentation for the latest information. Users are encouraged to submit issues, feature requests, and provide any additional feedback.

rocprof-compute#

rocprof-compute is a kernel performance analysis tool for High-Performance Computing (HPC) and Machine-Learning (ML) workloads using AMD Instinct™ GPUs. rocprof-compute uses rocprofiler-sdk to automate the collection of all available hardware performance counters via a single command. It also provides high-level performance analysis features including System Speed-of-Light, Hardware block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more through a command-line interface (CLI), a graphical user interface (GUI), and a Text User Interface (TUI). These analyses help users understand and analyze bottlenecks in their computational workloads on AMD Instinct™ GPUs. rocprof-compute has evolved from the former Omniperf research tool from AMD.

Note that rocprof-compute collects hardware counters in multiple passes by default, and will therefore re-run the application during each pass to collect different sets of metrics. An upcoming feature in rocprof-compute will support single-pass collection of all necessary hardware counters using the iteration multiplexing mechanism.

../../_images/mca-diagram-example.png

Figure 3: rocprof-compute memory chart analysis panel.

In a nutshell, rocprof-compute provides details about hardware activity for a particular GPU kernel. For up-to-date information on available features, we highly encourage readers to view the official rocprof-compute documentation. We welcome community engagement on GitHub through issues, feature requests, contributions, and feedback.

Radeon™ GPU Profiler#

The Radeon™ GPU Profiler is a performance tool that can be used by traditional gaming and visualization developers to optimize DirectX 12 (DX12) and Vulkan™ for AMD RDNA™ hardware. The Radeon™ GPU Profiler (RGP) is a ground-breaking low-level optimization tool from AMD. It provides detailed timing information on Radeon™ Graphics using custom, built-in, hardware thread-tracing, allowing the developer deep inspection of GPU workloads. This unique tool generates easy-to-understand visualizations of how your DX12 and Vulkan™ games interact with the GPU at the hardware level. Profiling a game is both a quick and simple process using the Radeon™ Developer Panel together with the public display driver.

Note that the Radeon™ GPU Profiler does have support for OpenCL™ and HIP applications but it requires running on AMD RDNA™ GPUs in the Windows environment. Running HIP and OpenCL™ in the Windows environment is a whole blog series in itself and outside the current recommendation for HPC applications. For HPC workloads, we recommend programming with HIP in a Linux® environment on AMD Instinct™ GPUs and using rocprof-compute, rocprof-sys or rocprofv3 profiling tools on them.

AMD uProf#

AMD uProf (AMD MICRO-prof) is a software profiling analysis tool for x86 applications running on Windows, Linux® and FreeBSD operating systems and provides event information unique to the AMD “Zen”-based processors and AMD Instinct™ MI Series accelerators. AMD uProf enables the developer to better understand the limiters of application performance and evaluate improvements.

AMD uProf offers:

  • Performance Analysis to identify runtime performance bottlenecks of the application

  • System Analysis to monitor system performance metrics

  • Roofline analysis

  • Power Profiling to monitor thermal and power characteristics of the system

  • Energy Analysis to identify energy hot spots in the application (Windows only)

  • Remote Profiling to connect to remote Linux® systems (from a Windows host system), trigger collection/translation of data on the remote system and report it in local GUI

  • Support for AMD CDNA™ accelerators for AMD Instinct™ MI100, MI200, and MI300 Series devices

Other third-party tools#

In the High Performance Computing space, a number of third-party profiling tools have enabled support for ROCm™ and AMD Instinct™ GPUs. This provides a platform for users to maintain a vendor independent approach to profiling, providing an easy-to-use and high-level suite of functionality, that can potentially provide a unified profiling experience across numerous architectures. For users already familiar with these tools, it makes for another easy entry point into understanding performance of their workloads on AMD hardware.

Currently available third-party profiling tools include:

Superseded Tools#

As our platform grows, some tools naturally retire and make way for improved versions. This section documents those retired tools so you can easily see what’s changed and which modern tools now fill their role.

Profiling guide blog series#

To go beyond this introduction and see these tools applied to real-world workloads, check out our Performance Profiling on AMD GPUs blog series:

  • Part 1: Foundations — Tool overview, installation guidance, and prerequisites for both novice and advanced users.

  • Part 2: Basic Usage — Step-by-step walkthrough for beginners: identifying hot spots, roofline analysis, and occupancy-driven optimization using a HIP Jacobi solver.

  • Part 3: Advanced Usage — Multi-GPU and multi-node profiling with rocprof-sys, MPI tracing, network profiling, and deeper kernel analysis with rocprof-compute.

  • Part 4: Fortran OpenMP Offload Edition (coming soon) — Profiling techniques tailored for Fortran applications using OpenMP target offloading, demonstrated on the GenASiS astrophysics simulation code running on AMD MI300A APUs.

Useful resources#

The following are links to the GitHub repos and ROCm docs for the tools described above:

Updated on 10 April 2026

Updated the blog to reflect evolution of the profiling tools’ capabilities and naming.

Note

This blog was originally uploaded on April 12, 2023