Getting to Know Your GPU: A Deep Dive into AMD SMI#

17 Sep, 2024 by Matt Elliott

For system administrators and power users working with AMD hardware, performance optimization and efficient monitoring of resources is paramount. The AMD System Management Interface command-line tool, amd-smi, addresses these needs.

amd-smi is a versatile command-line utility designed to manage and monitor AMD hardware, with a primary focus on GPUs. As the future replacement for rocm-smi, amd-smi is poised to become the primary tool for AMD hardware management across a wide range of devices. For those new to hardware management or transitioning from other tools, amd-smi provides an extensive set of features to help optimize AMD hardware usage.

In this blog post, we will provide you with a practical walkthrough of amd-smi. We will show you, step-by-step, how to verify the installation of amd-smi and how to use its main features: access your AMD GPU’s information and metrics, monitor its performance and running processes in real-time, configure its hardware parameters, inspect your AMD GPU’s topology and memory, and more.

Understanding system management interfaces#

System Management Interfaces (SMIs) are fundamental to modern hardware management and monitoring. These interfaces serve as APIs that provide a standardized way to interact with hardware components. SMIs offer insights into performance and status while allowing for a degree of control.

Users typically interact with SMIs through command-line tools like amd-smi or programming libraries, rather than engaging directly with the API. This model enables users to integrate hardware management into their own monitoring and automation frameworks, enabling efficient resource utilization and real-time alerts.

While capabilities might vary between vendors and hardware types, the overarching purpose of SMIs remains consistent: provide users with visibility into and control over their hardware.

Key features of amd-smi#

  • Device information: Quickly retrieve detailed information about AMD GPUs

  • Performance monitoring: Real-time monitoring of GPU utilization, memory, temperature, and power consumption

  • Process information: Identify which processes are using GPUs

  • Configuration management: Adjust GPU settings like clock speeds and power limits

  • Error reporting: Monitor and report GPU errors for proactive maintenance

Getting started with amd-smi#

On systems with AMD ROCm™ installed, amd-smi should already be available. Verify installation by running this command:

amd-smi version

If it’s not installed, use the system package manager to install amd-smi-lib. For example:

  • Install on Ubuntu: sudo apt install amd-smi-lib

  • Install on RedHat Enterprise Linux (RHEL): sudo dnf install amd-smi-lib

  • Install on SUSE Linux Enterprise Server (SLES): sudo zypper install amd-smi-lib

Basic usage#

Here are some common commands and tips for using amd-smi.

List all GPUs#

The amd-smi list command displays a list of all AMD GPUs in your system, along with basic information like their IDs, PCIe bus addresses, and UUIDs.

$ amd-smi list
GPU: 0
    BDF: 0000:05:00.0
    UUID: afff74a1-0000-1000-8054-e92b0a5d57c8
 
GPU: 1
    BDF: 0000:26:00.0
    UUID: 0aff74a1-0000-1000-805b-ce698de95724
 
GPU: 2
    BDF: 0000:46:00.0
    UUID: 97ff74a1-0000-1000-8065-fa81273af9ce
 
[output truncated]

Display detailed GPU information#

The amd-smi static command provides comprehensive static information about GPUs, including hardware details, driver versions, and capabilities.

$ amd-smi static
GPU: 0
    ASIC:
        MARKET_NAME: MI300X-O
        VENDOR_ID: 0x1002
        VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI]
        SUBVENDOR_ID: 0x1002
        DEVICE_ID: 0x74a1
        REV_ID: 0x00
        ASIC_SERIAL: 0xAF54E92B0A5D57C8
        OAM_ID: 7
    BUS:
        BDF: 0000:05:00.0
        MAX_PCIE_WIDTH: 16
        MAX_PCIE_SPEED: 32 GT/s
        PCIE_INTERFACE_VERSION: Gen 5
        SLOT_TYPE: OAM
 
[output truncated]

Display detailed GPU metrics#

Use amd-smi metric to view real-time metrics such as GPU utilization, temperature, power consumption, and memory usage.

$ amd-smi metric
GPU: 0
    USAGE:
        GFX_ACTIVITY: 100 %
        UMC_ACTIVITY: 0 %
        MM_ACTIVITY: N/A
        VCN_ACTIVITY: [0 %, 0 %, 0 %, 0 %]
    POWER:
        SOCKET_POWER: 234 W
        GFX_VOLTAGE: N/A mV
        SOC_VOLTAGE: N/A mV
        MEM_VOLTAGE: N/A mV
        POWER_MANAGEMENT: ENABLED
        THROTTLE_STATUS: UNTHROTTLED
    CLOCK:
        GFX_0:
            CLK: 2102 MHz
            MIN_CLK: 500 MHz
            MAX_CLK: 2102 MHz
            CLK_LOCKED: DISABLED
            DEEP_SLEEP: DISABLED
        GFX_1:
            CLK: 2101 MHz
            MIN_CLK: 500 MHz
            MAX_CLK: 2102 MHz
            CLK_LOCKED: DISABLED
            DEEP_SLEEP: DISABLED
        GFX_2:
            CLK: 2107 MHz
            MIN_CLK: 500 MHz
            MAX_CLK: 2102 MHz
            CLK_LOCKED: DISABLED
            DEEP_SLEEP: DISABLED
[output truncated]

Performance monitoring#

The amd-smi monitor command displays utilization metrics for GPUs, memory, power, PCIe bandwidth, and more. By default, amd-smi monitor outputs 18 metrics for every GPU. Passing in specific arguments limits the types of metrics displayed.

-p, --power-usage            Monitor power usage in Watts
-t, --temperature            Monitor temperature in Celsius
-u, --gfx                    Monitor graphics utilization (%) and clock (MHz)
-m, --mem                    Monitor memory utilization (%) and clock (MHz)
-n, --encoder                Monitor encoder utilization (%) and clock (MHz)
-d, --decoder                Monitor decoder utilization (%) and clock (MHz)
-s, --throttle-status        Monitor thermal throttle status
-e, --ecc                    Monitor ECC single bit, ECC double bit, and PCIe replay error counts
-v, --vram-usage             Monitor memory usage in MB
-r, --pcie                   Monitor PCIe bandwidth in Mb/s

For example, to monitor power usage, GPU utilization, temperature, and memory utilization, run amd-smi monitor -putm.

$ amd-smi monitor -putm
GPU  POWER  GPU_TEMP  MEM_TEMP  GFX_UTIL  GFX_CLOCK  MEM_UTIL  MEM_CLOCK
  0  182 W     42 °C     41 °C      83 %   1613 MHz       0 %   1230 MHz
  1  143 W     40 °C     39 °C      13 %    358 MHz       0 %   1173 MHz
  2  117 W     41 °C     40 °C       0 %    120 MHz       0 %    900 MHz
  3  116 W     40 °C     38 °C       1 %    134 MHz       0 %    913 MHz
  4  118 W     42 °C     40 °C       0 %    120 MHz       0 %    900 MHz
  5  118 W     39 °C     38 °C       0 %    120 MHz       0 %    900 MHz
  6  116 W     41 °C     41 °C       0 %    120 MHz       0 %    900 MHz
  7  118 W     40 °C     37 °C       0 %    120 MHz       0 %    900 MHz

View running processes#

The amd-smi process command shows details about processes running on the GPU, including their PIDs, memory usage, and GPU utilization. Running the command with sudo includes processes owned by other users.

$ sudo amd-smi process
GPU: 0
    PROCESS_INFO:
        NAME: pt_main_thread
        PID: 207590
        MEMORY_USAGE:
            GTT_MEM: 2.0 MB
            CPU_MEM: 202.0 MB
            VRAM_MEM: 7.3 GB
        MEM_USAGE: 7.5 GB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns
 
GPU: 1
    PROCESS_INFO:
        NAME: pt_main_thread
        PID: 207591
        MEMORY_USAGE:
            GTT_MEM: 2.0 MB
            CPU_MEM: 202.0 MB
            VRAM_MEM: 7.4 GB
        MEM_USAGE: 7.6 GB
 
[output truncated]

Set configurable hardware parameters#

The amd-smi set command can be used to change hardware parameters such as fan speed, memory and compute partitioning, and power limits. For example, to adjust the power limit of a GPU using the set command:

amd-smi set -g 0 -o 650

This sets the power limit of GPU 0 to 650 watts. Remember to check the supported power range for your specific GPU model before making adjustments.

Use the amd-smi reset command to remove the custom power limit:

amd-smi reset -g 0 -o

Additional capabilities#

Run amd-smi --help to view the full list of available commands.

AMD-SMI Commands:
 
    version           Display version information
    list              List GPU information
    static            Gets static information about the specified GPU
    firmware (ucode)  Gets firmware information about the specified GPU
    bad-pages         Gets bad page information about the specified GPU
    metric            Gets metric/performance information about the specified GPU
    process           Lists general process information running on the specified GPU
    event             Displays event information for the given GPU
    topology          Displays topology information of the devices
    set               Set options for devices
    reset             Reset options for devices
    monitor           Monitor metrics for target devices
    xgmi              Displays xgmi information of the devices

Modifiers are supported with every command to output data as comma-separated values (CSV), JavaScript Object Notation (JSON) or directly to a file.

Command Modifiers:
  --json                   Displays output in JSON format (human readable by default).
  --csv                    Displays output in CSV format (human readable by default).
  --file FILE              Saves output into a file on the provided path (stdout by default).

For example, the --csv argument passed to amd-smi process outputs process information comma-separated values.

$ sudo amd-smi process --csv
gpu,name,pid,gtt_mem,cpu_mem,vram_mem,mem_usage,gfx,enc
0,pt_main_thread,207590,2134016,211795968,7889485824,8103415808,0,0
1,pt_main_thread,207591,2134016,211795968,7923122176,8137052160,0,0
2,pt_main_thread,207589,2134016,211795968,7889575936,8103505920,0,0
3,pt_main_thread,207588,2166784,211763200,7822258176,8036188160,0,0
4,pt_main_thread,207595,2134016,211795968,7889514496,8103444480,0,0
5,pt_main_thread,207590,2134016,211795968,7822381056,8036311040,0,0
6,pt_main_thread,207597,2134016,211795968,7923064832,8136994816,0,0
7,pt_main_thread,207593,2134016,211795968,7889465344,8103395328,0,0

The output can be piped to the column command to format the values as a table.

$ sudo amd-smi process --csv | column -t -s,
gpu  name            pid     gtt_mem  cpu_mem    vram_mem    mem_usage   gfx  enc
0    pt_main_thread  207590  2134016  211795968  7889485824  8103415808  0    0
1    pt_main_thread  207591  2134016  211795968  7923122176  8137052160  0    0
2    pt_main_thread  207589  2134016  211795968  7889575936  8103505920  0    0
3    pt_main_thread  207588  2166784  211763200  7822258176  8036188160  0    0
4    pt_main_thread  207595  2134016  211795968  7889514496  8103444480  0    0
5    pt_main_thread  207597  2134016  211795968  7822381056  8036311040  0    0
6    pt_main_thread  207594  2134016  211795968  7923064832  8136994816  0    0
7    pt_main_thread  207593  2134016  211795968  7889465344  8103395328  0    0

Combining JSON output with the jq command can be used to filter results. This example command filters the output from amd-smi static to only display VRAM information for the first GPU in the system.

$ amd-smi static --json | jq '.[0]["vram"]'
{
  "type": "HBM",
  "vendor": "N/A",
  "size": {
    "value": 196592,
    "unit": "MB"
  }
}

Display firmware information#

Run amd-smi firmware to view firmware information for all system GPUs.

$ amd-smi firmware
GPU: 0
    FW_LIST:
        FW 0:
            FW_ID: CP_MEC1
            FW_VERSION: 147
        FW 1:
            FW_ID: CP_MEC2
            FW_VERSION: 147
        FW 2:
            FW_ID: RLC
            FW_VERSION: 64
        FW 3:
            FW_ID: SDMA0
            FW_VERSION: 19
        FW 4:
            FW_ID: SDMA1
            FW_VERSION: 19
        FW 5:
            FW_ID: VCN
            FW_VERSION: 61.13.00.C
        FW 6:
            FW_ID: PSP_SOSDRV
            FW_VERSION: 36.02.4C
        FW 7:
            FW_ID: TA_RAS
            FW_VERSION: 20.00.00.0D
        FW 8:
            FW_ID: TA_XGMI
            FW_VERSION: 20.00.01.13
        FW 9:
            FW_ID: PM
            FW_VERSION: 85.110.0
 
[output truncated]

Inspect memory status#

Think of memory like a book with many pages, with each page representing a location where data is stored. If a page becomes “bad,” it means that the data stored on that page can’t be read or written correctly. Bad pages can occur due to electrical surges, wear and tear, or a variety of other reasons. When a GPU detects a bad page, it marks that page as unusable to prevent errors from spreading. Bad pages can be viewed with the amd-smi bad-pages command.

$ amd-smi bad-pages
GPU: 0
    RETIRED: No bad pages found.
    PENDING: No bad pages found.
    UN_RES: No bad pages found.
 
GPU: 1
    RETIRED: No bad pages found.
    PENDING: No bad pages found.
    UN_RES: No bad pages found.
 
[output truncated]

Display GPU topology information#

Run amd-smi topology to display topology information such as link accessibility, number of hops/relative weight between GPUs, link type, and NUMA bandwidth information.

$ amd-smi topology
ACCESS TABLE:
             0000:05:00.0 0000:26:00.0 0000:46:00.0 0000:65:00.0 0000:85:00.0 0000:a6:00.0 0000:c6:00.0 0000:e5:00.0
0000:05:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
0000:26:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
0000:46:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
0000:65:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
0000:85:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
0000:a6:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
0000:c6:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
0000:e5:00.0 ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED      ENABLED
 
WEIGHT TABLE:
             0000:05:00.0 0000:26:00.0 0000:46:00.0 0000:65:00.0 0000:85:00.0 0000:a6:00.0 0000:c6:00.0 0000:e5:00.0
0000:05:00.0 0            15           15           15           15           15           15           15
0000:26:00.0 15           0            15           15           15           15           15           15
0000:46:00.0 15           15           0            15           15           15           15           15
0000:65:00.0 15           15           15           0            15           15           15           15
0000:85:00.0 15           15           15           15           0            15           15           15
0000:a6:00.0 15           15           15           15           15           0            15           15
0000:c6:00.0 15           15           15           15           15           15           0            15
0000:e5:00.0 15           15           15           15           15           15           15           0
 
[output truncated]

Monitor hardware events#

Use the amd-smi event command to view event information for all GPUs in the system. After the tool launches, it continues to listen for and display GPU events until stopped. Event types include thermal throttling events, hardware resets, and memory read errors.

$ amd-smi event
EVENT LISTENING:
 
Press q and hit ENTER when you want to stop

Transitioning from rocm-smi#

Users familiar with rocm-smi will find that amd-smi offers similar functionality with some enhancements. Here’s a quick comparison of some common commands:

Task

rocm-smi

amd-smi

List GPUs

rocm-smi -i

amd-smi list

Show utilization

rocm-smi

amd-smi monitor

Show memory info

rocm-smi --showmemuse

amd-smi monitor -m -v

Show detailed hardware info and settings

rocm-smi -a

amd-smi static

While the syntax differs slightly, amd-smi generally offers more detailed output and additional features compared to rocm-smi.

Note: While rocm-smi will continue to receive bug fixes and maintenance updates, new features and additional hardware support will be prioritized for amd-smi.

Conclusion#

In this blog post we presented a practical deep dive into amd-smi, showing you how to use and access its main features and functionalities. Whether you’re managing a large-scale computing environment or optimizing a single server, amd-smi offers the insights and control needed to maximize the potential of AMD GPUs. To learn more about amd-smi and its capabilities, visit the amd-smi tool documentation.