Introduction to profiling tools for AMD hardware#
Getting a code to be functionally correct is not always enough. In many industries, it is also required that applications and their complex software stack run as efficiently as possible to meet operational demands. This is particularly challenging as hardware continues to evolve over time, and as a result codes may require further tuning. In practice, many application developers construct benchmarks, which are carefully designed to measure the performance, such as execution time, of a particular code within an operational-like setting. In other words: a good benchmark should be representative of the real work that needs to be done. These benchmarks are useful in that they provide insight into the characteristics of the application, and enable one to discover potential bottlenecks that could result in performance degradation during operational settings.
At face value, benchmarking sounds simple enough and is often interpreted as simply a comparison of execution time on a variety of different machines. However, in order to extract the most performance from emerging hardware, the program must be tuned many times and requires more than measuring raw execution time: one needs to know where the program is spending most of its time and whether further improvements can be made. Heterogeneous systems, where programs run on both CPUs and GPUs, introduce additional complexities. Understanding the critical path and kernel execution is all the more important. Thus, performance tuning is a necessary component in the benchmarking process.
With AMD’s profiling tools, developers are able to gain important insight into how efficiently their application is utilizing hardware and effectively diagnose potential bottlenecks contributing to poor performance. Developers targeting AMD GPUs have multiple tools available depending on their specific profiling needs. This post serves as an introduction to the various profiling tools offered by AMD and why a developer might leverage one over the other. This post covers everything from low-level profiling tools to extensive profiling suites.
In this introductory blog, we briefly describe the following tools that can aid in application analysis:
Terminology#
The following terms are used in this blog post:
Term |
Description |
|---|---|
AMD’s x86-64 processor core architecture design. Used by the AMD EPYC™ server processors and AMD Ryzen™, AMD Ryzen™ PRO, and AMD Threadripper™ PRO processor series. |
|
AMD’s GPU architecture optimized for graphically demanding workloads in gaming, visualization, and professional applications. Includes the AMD Radeon™ PRO W6000, W7000 series and Radeon™ AI PRO R9000 series (RDNA 3 and RDNA 4 architectures). |
|
AMD’s Compute dedicated GPU architecture optimized for accelerating HPC, ML/AI, and data center type workloads. Includes the AMD Instinct™ MI100, MI200, MI300, and MI350 series accelerators. |
|
A C++ Runtime API and kernel language that allows developers to create portable compute kernels/applications for AMD and NVIDIA GPUs from a single source code |
|
A profiling approach where duration of compute kernels and data transfers between devices are collected and visualized |
|
Hardware agnostic methodology for quantifying a workload’s ability to saturate the given compute architecture in terms of floating-point compute and memory bandwidth |
|
Hardware Counters |
Individual metrics which track how many times a certain event occurs in the hardware, such as bytes moved from L2 cache or a 32 bit floating point add performed |
What tools to use?#
The first step in profiling is determining the right tool for the job. Whether one wants to collect
traces on the CPU, GPU, or both, understand kernel behavior, or assess memory access patterns, performing such an analysis might
appear daunting for new users of AMD hardware.
We begin by identifying the architecture and operating systems supported by each of the profiling tools
provided by AMD. Almost all the tools in Table 1 support Linux® distros and with the gaining popularity of
Instinct™ GPUs, every tool has some capability to profile codes running on CDNA™ architecture. However,
those who prefer Windows will be limited to using AMD
AMD Profiling Tools |
AMD “Zen” Core |
RDNA™ |
CDNA™ |
Windows |
Linux® |
|---|---|---|---|---|---|
rocprofv3 / rocprofiler-sdk |
Not supported |
☆ |
★ |
Not supported |
★ |
rocprof-sys |
★ |
☆ |
★ |
Not supported |
★ |
rocprof-compute |
Not supported |
Not supported |
★ |
Not supported |
★ |
Radeon™ GPU Profiler |
Not supported |
★ |
☆ |
★ |
☆ |
AMD |
★ |
Not supported |
☆ |
★ |
☆ |
★ Full support | ☆ Partial support
Table 1: Profiler/architecture support and operating system needs.
The final choice of the tool on any platform depends on the profiling objective and the kind of analysis required. To make it simpler, we encourage the users to think of their objectives in terms of three questions as depicted in the flow diagram in Figure 1:
Where should I focus my time? : Whether benchmarking a new application, or getting started with a new software package that has not yet been profiled, it is recommended to first identify hot spots in the application that may benefit from quick optimization. In such a scenario, it is best if users start by collecting timelines and traces of their application. On Linux® platforms,
rocprof-sysenables the collection of CPU and GPU traces, and call stack samples to help identify major hot spots. For GPU-only traces,rocprofv3provides a lightweight and powerful alternative. However, on Windows, one may have to choose between AMDu Prof and Radeon™ GPU Profiler depending on the targeted architecture.How well am I using the hardware?: The first step is to obtain a characterization of workloads that can provide a glimpse into how well the hardware is being utilized. For example, identifying what parts of your application are memory or compute bound. This can be accomplished through roofline profiling. Typically, hot spots are well understood and interest is usually in identifying the performance of a few key kernels or subroutines. At present, roofline profiling is only available through
rocprof-computeon AMD Instinct™ GPUs and AMDu Prof on AMD “Zen”-based processors.Why am I seeing this performance?: Once hot spots are identified and the initial assessment of performance is completed, the next phase likely involves profiling and collecting the hardware metrics to understand where the observed performance is coming from. On AMD GPUs, tools such as
rocprofv3,rocprof-sys,rocprof-compute, and AMDu Prof use APIs from therocprofiler-sdklibrary to gather GPU metrics. On Windows systems, one will have to rely on using either AMDu Prof or Radeon™ GPU Profiler.
Quick Tip: The ROCm profiling tools (
rocprofv3,rocprof-compute, androcprof-sys), available on Linux® platforms, provide an easy-to-use interface for studying performance of the code across AMD hardware and should be treated as “go-to” profiling tools for performance tuning and benchmarking.
Figure 1: Use cases for a variety of AMD profiling tools.
Overview of profiling tools#
In this section, we provide a brief overview of the above-mentioned AMD tools and some third-party toolkits.
rocprofiler-sdk and rocprofv3#
The rocprofiler-sdk library is the foundation of AMD GPU profiling and tracing
infrastructure, shipped with ROCm™. It provides the low-level API for device activity tracing and hardware
performance counter collection. This library replaces the functionality of the legacy rocprofiler and roctracer libraries.
rocprofv3 is the command-line tool built on rocprofiler-sdk and replaces the legacy rocprof and rocprofv2 tools.
Some of the most useful features of rocprofv3 are:
Collect a variety of traces for ROCm-based applications (HIP API, HSA API, offloaded kernels, memory copies, scratch memory, marker API, etc.)
Collect device hardware counters for GPU kernel performance analysis
Find GPU hot spots quickly
Profile Python workloads efficiently
Starting in ROCm 7.0, rocprofv3 defaults to writing profiling data to a SQLite3-based database, and a new companion
tool called rocpd is available to generate CSV, OTF2, and/or Perfetto trace files from the collected database or generate summaries of profiled regions.
This “profile once, analyze many times” workflow enables efficient post-processing without re-running the application.
We strongly urge you to use rocprofv3 as the legacy rocprof tool will be deprecated in future ROCm releases.
For more details, see the rocprofv3 documentation
and the rocprofiler-sdk library documentation.
rocprof-sys#
The ROCm systems profiler, also known as rocprof-sys, is best suited to collect
host, device, and communication (MPI) activity in one comprehensive,
unified trace of your application’s run. rocprof-sys uses libraries that tools like
amd-smi and
perf are built upon to provide a holistic
view of the system where your application ran.
With configurable runtime options, you can use it for call-stack sampling, binary instrumentation,
causal profiling, and hardware counter collection. Simply put, this is the tool you would use
if you want to understand what is happening in your application on the host and
on the device. The output trace is in protobuf format (.proto) for easy
visualization using Perfetto UI in a browser.
rocprof-sys has evolved from the former Omnitrace research tool from AMD.
It now supports the rocprofiler-sdk library, bringing advanced capabilities such as OMPT support for
tracing Fortran codes with OpenMP® offload, writing to the rocpd output format, and network performance profiling.
When analyzing the performance of an application, it is always best to NOT
assume you know where the performance bottlenecks are and why they are
happening. rocprof-sys is the ideal tool for characterizing where optimization would
have the greatest impact on the end-to-end execution of the application and/or
viewing what else is happening on the system during a performance bottleneck.
Figure 2: rocprof-sys timeline trace example.
Please see the official rocprof-sys documentation for the latest information. Users are encouraged to submit issues, feature requests, and provide any additional feedback.
rocprof-compute#
rocprof-compute is a kernel performance analysis tool
for High-Performance Computing (HPC) and Machine-Learning (ML) workloads using AMD Instinct™ GPUs.
rocprof-compute uses rocprofiler-sdk to automate the collection of all available hardware performance counters via a single command. It also provides high-level performance analysis features including System Speed-of-Light, Hardware block Speed-of-Light, Memory Chart
Analysis, Roofline Analysis, Baseline Comparisons, and more through a command-line interface (CLI), a graphical user interface (GUI), and a Text User Interface (TUI). These analyses help
users understand and analyze bottlenecks in their computational workloads on AMD Instinct™
GPUs. rocprof-compute has evolved from the former Omniperf research tool from AMD.
Note that rocprof-compute collects hardware counters in multiple passes by default,
and will therefore re-run the application during each pass to collect different sets of metrics.
An upcoming feature in rocprof-compute will support single-pass collection of all
necessary hardware counters using the iteration multiplexing mechanism.
Figure 3: rocprof-compute memory chart analysis panel.
In a nutshell, rocprof-compute provides details about hardware activity for a particular GPU kernel.
For up-to-date information on available features, we highly encourage readers to view
the official rocprof-compute documentation.
We welcome community engagement on GitHub through issues,
feature requests, contributions, and feedback.
Radeon™ GPU Profiler#
The Radeon™ GPU Profiler is a performance tool that can be used by traditional gaming and visualization developers to optimize DirectX 12 (DX12) and Vulkan™ for AMD RDNA™ hardware. The Radeon™ GPU Profiler (RGP) is a ground-breaking low-level optimization tool from AMD. It provides detailed timing information on Radeon™ Graphics using custom, built-in, hardware thread-tracing, allowing the developer deep inspection of GPU workloads. This unique tool generates easy-to-understand visualizations of how your DX12 and Vulkan™ games interact with the GPU at the hardware level. Profiling a game is both a quick and simple process using the Radeon™ Developer Panel together with the public display driver.
Note that the Radeon™ GPU Profiler does have support for OpenCL™ and HIP applications but
it requires running on AMD RDNA™ GPUs in the Windows environment. Running
HIP and OpenCL™ in the Windows environment is a whole blog series in itself and
outside the current recommendation for HPC applications. For HPC workloads, we
recommend programming with HIP in a Linux® environment on AMD Instinct™ GPUs and
using rocprof-compute, rocprof-sys or rocprofv3 profiling tools on them.
AMD uProf#
AMD
AMD
Performance Analysis to identify runtime performance bottlenecks of the application
System Analysis to monitor system performance metrics
Roofline analysis
Power Profiling to monitor thermal and power characteristics of the system
Energy Analysis to identify energy hot spots in the application (Windows only)
Remote Profiling to connect to remote Linux® systems (from a Windows host system), trigger collection/translation of data on the remote system and report it in local GUI
Support for AMD CDNA™ accelerators for AMD Instinct™ MI100, MI200, and MI300 Series devices
Other third-party tools#
In the High Performance Computing space, a number of third-party profiling tools have enabled support for ROCm™ and AMD Instinct™ GPUs. This provides a platform for users to maintain a vendor independent approach to profiling, providing an easy-to-use and high-level suite of functionality, that can potentially provide a unified profiling experience across numerous architectures. For users already familiar with these tools, it makes for another easy entry point into understanding performance of their workloads on AMD hardware.
Currently available third-party profiling tools include:
Cray Performance and Analysis Tools (only for CrayOS platforms)
Superseded Tools#
As our platform grows, some tools naturally retire and make way for improved versions. This section documents those retired tools so you can easily see what’s changed and which modern tools now fill their role.
Omnitrace: replaced by
rocprof-sysOmniperf: replaced by
rocprof-computerocprofandrocprofv2: replaced byrocprofv3rocprofilerandroctracerlibraries: replaced by therocprofiler-sdk library
Profiling guide blog series#
To go beyond this introduction and see these tools applied to real-world workloads, check out our Performance Profiling on AMD GPUs blog series:
Part 1: Foundations — Tool overview, installation guidance, and prerequisites for both novice and advanced users.
Part 2: Basic Usage — Step-by-step walkthrough for beginners: identifying hot spots, roofline analysis, and occupancy-driven optimization using a HIP Jacobi solver.
Part 3: Advanced Usage — Multi-GPU and multi-node profiling with
rocprof-sys, MPI tracing, network profiling, and deeper kernel analysis withrocprof-compute.Part 4: Fortran OpenMP Offload Edition (coming soon) — Profiling techniques tailored for Fortran applications using OpenMP target offloading, demonstrated on the GenASiS astrophysics simulation code running on AMD MI300A APUs.
Useful resources#
The following are links to the GitHub repos and ROCm docs for the tools described above:
rocprofiler-sdk/rocprofv3:rocprof-sys:rocprof-compute:AMD uProf:
Updated on 10 April 2026
Updated the blog to reflect evolution of the profiling tools’ capabilities and naming.
Note
This blog was originally uploaded on April 12, 2023