Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct™ MI300X#

Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct MI300X
March 02, 2025 by Jayacharan Kolla, Pedram Alizadeh, Gilbert Lee.
19 min read. | 4602 total words.

Efficient inter-GPU communication is the backbone of high-performance AI and HPC workloads, where technologies like RCCL and xGMI play pivotal roles. However, some limitations in achieving theoretical peak bandwidth have raised questions about performance bottlenecks. In this blog we explain the limitations to achieve the theoretical maximum bandwidth in multi-GPU clusters, and teach you how to perform a set of diagnostics and performance-tuning strategies that will help you optimize RCCL and xGMI bandwidth on AMD MI300X systems. We will first introduce you to xGMI and its performance constraints, to RCCL and its bandwidth limitations, and then cover several practical benchmarks and best practices for maximizing RCCL efficiency.

xGMI Overview and Performance Expectations#

What is xGMI?#

xGMI (External Global Memory Interconnect) is AMD’s high-speed GPU-to-GPU interconnect based on Infinity Fabric™ technology. It is designed to enable efficient peer-to-peer GPU communication for multi-GPU AI and HPC workloads. This interconnect is crucial for ensuring that large-scale AI models and distributed HPC simulations run efficiently across multiple GPUs. Each AMD Instinct MI300X system GPU is connected to its seven peer GPUs via xGMI links, forming a fully connected mesh with high-bandwidth, low-latency communication. For more information related to the MI300X Platform, refer to this datasheet.

Theoretical Performance Claims vs. Achievable#

xGMI’s raw bandwidth is rated up to 64 GB/s for each point-to-point link between GPUs. In an ideal situation, this means that each GPU in a system can communicate with every other GPU at this peak rate. However, in reality, this bandwidth is constrained by factors like CRC bit correction and protocol overhead, which reduce the usable bandwidth to approximately 48 GB/s per link. This represents about 75% of the theoretical peak bandwidth, which is a critical factor when evaluating performance in practical workloads.

Specification

Theoretical Max

Realized Performance

Aggregated unidirectional bandwidth per GPU

448 GB/s (7×64 GB/s)

315 - 336 GB/s (7× 45-48 GB/s)

GPU-GPU Unidirectional BW via xGMI Link

64 GB/s

45 - 48 GB/s

Why Does RCCL Bandwidth shuffle between 310-330 GB/s for the best-case scenario?#

While in reality, the aggregated unidirectional bandwidth per GPU is 336 GB/s, in practical AI workloads, the slowest link limits the total available bandwidth.

Example: Let’s practically test out the xGMI Link performances between GPUs in a single-node MI300X machine. For this experimentation, we will be using the Transferbench tool. TransferBench. is a benchmarking utility designed to measure the performance of simultaneous data transfers between user-specified devices, such as CPUs and GPUs. For this example, we will use Transferbench to identify Inter-GPU xGMI scale-up link bandwidth. Refer to these steps to install and build Transferbench.

The following TransferBench command can be used on MI300X systems to measure the achievable bandwidth between any pair of directly connected GPUs, as well as the total All-to-All aggregate bandwidth. This example below evaluates AllToAll bandwidth using fine-grained memory, an unroll factor = 2, and 8 Compute Units (CUs) per GPU. The results provide an excellent upper-bound estimate of the bandwidth that RCCL can achieve.

#Transferbench All to All Run
./TransferBench a2a 64M 8
TransferBench v1.60.00 (with NIC support)
===============================================================
[Common]                              (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE      =            0 : Validating after all iterations
BLOCK_BYTES          =          256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET          =            0 : Using byte offset of 0
CLOSEST_NIC          =         auto : Per-GPU closest NIC is set as auto
CU_MASK              =            0 : All
FILL_PATTERN         =            0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_SIZE       =          256 : Threadblock size of 256
GFX_SINGLE_TEAM      =            1 : Combining CUs to work across entire data array
GFX_UNROLL           =            2 : Using GFX unroll factor of 2
GFX_WAVE_ORDER       =            0 : Using GFX wave ordering of Unroll,Wavefront,CU
IP_ADDRESS_FAMILY    =            4 : IP address family is set to IPv4
IB_GID_INDEX         =           -1 : RoCE GID index is set to auto
IB_PORT_NUMBER       =            1 : IB port number is set to 1
MIN_VAR_SUBEXEC      =            1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC      =            0 : Using up to all available subexecutors for variable subExec transfers
NIC_RELAX_ORDER      =            1 : Using relaxed ordering for NIC RDMA
NUM_ITERATIONS       =           10 : Running 10  timed iteration(s)
NUM_SUBITERATIONS    =            1 : Running 1 subiterations
NUM_WARMUPS          =            3 : Running 3 warmup iteration(s) per Test
ROCE_VERSION         =            2 : RoCE version is set to 2
SHOW_ITERATIONS      =            0 : Hiding per-iteration timing
USE_HIP_EVENTS       =            1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA          =            0 : Using hipMemcpyAsync for DMA execution
USE_INTERACTIVE      =            0 : Running in non-interactive mode
USE_SINGLE_STREAM    =            1 : Using single stream per GFX device
VALIDATE_DIRECT      =            0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE      =            0 : Do not perform source validation after prep

[AllToAll Related]
A2A_DIRECT           =            1 : Only using direct links
A2A_LOCAL            =            0 : Exclude local transfers
A2A_MODE             =            0 : Copy
NUM_GPU_DEVICES      =            8 : Using 8 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_FINE_GRAIN       =            1 : Using fine-grained memory
USE_REMOTE_READ      =            0 : Using SRC as executor

GPU-GFX All-To-All benchmark:
==========================
- Copying 67108864 bytes between directly connected pairs of GPUs using 8 CUs (56 Transfers)
Test 1:
 Executor: GPU 00 |  315.552 GB/s |    1.489 ms |    469762048 bytes | 326.317 GB/s (sum)
     Transfer 00  |   47.056 GB/s |    1.426 ms |     67108864 bytes | F0 -> G000:008 -> F1
     Transfer 01  |   46.826 GB/s |    1.433 ms |     67108864 bytes | F0 -> G000:008 -> F2
     Transfer 02  |   47.126 GB/s |    1.424 ms |     67108864 bytes | F0 -> G000:008 -> F3
     Transfer 03  |   46.432 GB/s |    1.445 ms |     67108864 bytes | F0 -> G000:008 -> F4
     Transfer 04  |   46.883 GB/s |    1.431 ms |     67108864 bytes | F0 -> G000:008 -> F5
     Transfer 05  |   46.784 GB/s |    1.434 ms |     67108864 bytes | F0 -> G000:008 -> F6
     Transfer 06  |   45.210 GB/s |    1.484 ms |     67108864 bytes | F0 -> G000:008 -> F7
 Executor: GPU 01 |  317.663 GB/s |    1.479 ms |    469762048 bytes | 325.040 GB/s (sum)
     Transfer 07  |   47.000 GB/s |    1.428 ms |     67108864 bytes | F1 -> G001:008 -> F0
     Transfer 08  |   46.751 GB/s |    1.435 ms |     67108864 bytes | F1 -> G001:008 -> F2
     Transfer 09  |   46.935 GB/s |    1.430 ms |     67108864 bytes | F1 -> G001:008 -> F3
     Transfer 10  |   46.436 GB/s |    1.445 ms |     67108864 bytes | F1 -> G001:008 -> F4
     Transfer 11  |   45.543 GB/s |    1.474 ms |     67108864 bytes | F1 -> G001:008 -> F5
     Transfer 12  |   46.260 GB/s |    1.451 ms |     67108864 bytes | F1 -> G001:008 -> F6
     Transfer 13  |   46.116 GB/s |    1.455 ms |     67108864 bytes | F1 -> G001:008 -> F7
 Executor: GPU 02 |  317.041 GB/s |    1.482 ms |    469762048 bytes | 325.612 GB/s (sum)
     Transfer 14  |   47.098 GB/s |    1.425 ms |     67108864 bytes | F2 -> G002:008 -> F0
     Transfer 15  |   46.785 GB/s |    1.434 ms |     67108864 bytes | F2 -> G002:008 -> F1
     Transfer 16  |   47.032 GB/s |    1.427 ms |     67108864 bytes | F2 -> G002:008 -> F3
     Transfer 17  |   46.048 GB/s |    1.457 ms |     67108864 bytes | F2 -> G002:008 -> F4
     Transfer 18  |   46.829 GB/s |    1.433 ms |     67108864 bytes | F2 -> G002:008 -> F5
     Transfer 19  |   45.451 GB/s |    1.476 ms |     67108864 bytes | F2 -> G002:008 -> F6
     Transfer 20  |   46.370 GB/s |    1.447 ms |     67108864 bytes | F2 -> G002:008 -> F7
 Executor: GPU 03 |  315.094 GB/s |    1.491 ms |    469762048 bytes | 326.649 GB/s (sum)
     Transfer 21  |   46.931 GB/s |    1.430 ms |     67108864 bytes | F3 -> G003:008 -> F0
     Transfer 22  |   47.009 GB/s |    1.428 ms |     67108864 bytes | F3 -> G003:008 -> F1
     Transfer 23  |   46.941 GB/s |    1.430 ms |     67108864 bytes | F3 -> G003:008 -> F2
     Transfer 24  |   45.142 GB/s |    1.487 ms |     67108864 bytes | F3 -> G003:008 -> F4
     Transfer 25  |   46.718 GB/s |    1.436 ms |     67108864 bytes | F3 -> G003:008 -> F5
     Transfer 26  |   46.982 GB/s |    1.428 ms |     67108864 bytes | F3 -> G003:008 -> F6
     Transfer 27  |   46.925 GB/s |    1.430 ms |     67108864 bytes | F3 -> G003:008 -> F7
 Executor: GPU 04 |  318.348 GB/s |    1.476 ms |    469762048 bytes | 325.427 GB/s (sum)
     Transfer 28  |   46.551 GB/s |    1.442 ms |     67108864 bytes | F4 -> G004:008 -> F0
     Transfer 29  |   46.333 GB/s |    1.448 ms |     67108864 bytes | F4 -> G004:008 -> F1
     Transfer 30  |   46.084 GB/s |    1.456 ms |     67108864 bytes | F4 -> G004:008 -> F2
     Transfer 31  |   45.614 GB/s |    1.471 ms |     67108864 bytes | F4 -> G004:008 -> F3
     Transfer 32  |   46.670 GB/s |    1.438 ms |     67108864 bytes | F4 -> G004:008 -> F5
     Transfer 33  |   47.033 GB/s |    1.427 ms |     67108864 bytes | F4 -> G004:008 -> F6
     Transfer 34  |   47.140 GB/s |    1.424 ms |     67108864 bytes | F4 -> G004:008 -> F7
 Executor: GPU 05 |  317.032 GB/s |    1.482 ms |    469762048 bytes | 326.659 GB/s (sum)
     Transfer 35  |   46.831 GB/s |    1.433 ms |     67108864 bytes | F5 -> G005:008 -> F0
     Transfer 36  |   45.424 GB/s |    1.477 ms |     67108864 bytes | F5 -> G005:008 -> F1
     Transfer 37  |   46.821 GB/s |    1.433 ms |     67108864 bytes | F5 -> G005:008 -> F2
     Transfer 38  |   46.768 GB/s |    1.435 ms |     67108864 bytes | F5 -> G005:008 -> F3
     Transfer 39  |   46.734 GB/s |    1.436 ms |     67108864 bytes | F5 -> G005:008 -> F4
     Transfer 40  |   47.073 GB/s |    1.426 ms |     67108864 bytes | F5 -> G005:008 -> F6
     Transfer 41  |   47.008 GB/s |    1.428 ms |     67108864 bytes | F5 -> G005:008 -> F7
 Executor: GPU 06 |  317.454 GB/s |    1.480 ms |    469762048 bytes | 325.979 GB/s (sum)
     Transfer 42  |   46.844 GB/s |    1.433 ms |     67108864 bytes | F6 -> G006:008 -> F0
     Transfer 43  |   46.278 GB/s |    1.450 ms |     67108864 bytes | F6 -> G006:008 -> F1
     Transfer 44  |   45.483 GB/s |    1.475 ms |     67108864 bytes | F6 -> G006:008 -> F2
     Transfer 45  |   46.713 GB/s |    1.437 ms |     67108864 bytes | F6 -> G006:008 -> F3
     Transfer 46  |   46.936 GB/s |    1.430 ms |     67108864 bytes | F6 -> G006:008 -> F4
     Transfer 47  |   46.862 GB/s |    1.432 ms |     67108864 bytes | F6 -> G006:008 -> F5
     Transfer 48  |   46.864 GB/s |    1.432 ms |     67108864 bytes | F6 -> G006:008 -> F7
 Executor: GPU 07 |  314.759 GB/s |    1.492 ms |    469762048 bytes | 325.207 GB/s (sum)
     Transfer 49  |   45.095 GB/s |    1.488 ms |     67108864 bytes | F7 -> G007:008 -> F0
     Transfer 50  |   46.475 GB/s |    1.444 ms |     67108864 bytes | F7 -> G007:008 -> F1
     Transfer 51  |   46.049 GB/s |    1.457 ms |     67108864 bytes | F7 -> G007:008 -> F2
     Transfer 52  |   46.713 GB/s |    1.437 ms |     67108864 bytes | F7 -> G007:008 -> F3
     Transfer 53  |   46.907 GB/s |    1.431 ms |     67108864 bytes | F7 -> G007:008 -> F4
     Transfer 54  |   47.182 GB/s |    1.422 ms |     67108864 bytes | F7 -> G007:008 -> F5
     Transfer 55  |   46.786 GB/s |    1.434 ms |     67108864 bytes | F7 -> G007:008 -> F6
 Aggregate (CPU)  | 1937.924 GB/s |    1.939 ms |   3758096384 bytes | Overhead: 0.447 ms

Summary: [67108864 bytes per Transfer]
==========================================================
SRC\DST  GPU 00     GPU 01     GPU 02     GPU 03     GPU 04     GPU 05     GPU 06     GPU 07        STotal      Actual
GPU 00      N/A     47.056     46.826     47.126     46.432     46.883     46.784     45.210       326.317     316.469
GPU 01   47.000        N/A     46.751     46.935     46.436     45.543     46.260     46.116       325.040     318.799
GPU 02   47.098     46.785        N/A     47.032     46.048     46.829     45.451     46.370       325.612     318.160
GPU 03   46.931     47.009     46.941        N/A     45.142     46.718     46.982     46.925       326.649     315.994
GPU 04   46.551     46.333     46.084     45.614        N/A     46.670     47.033     47.140       325.427     319.297
GPU 05   46.831     45.424     46.821     46.768     46.734        N/A     47.073     47.008       326.659     317.968
GPU 06   46.844     46.278     45.483     46.713     46.936     46.862        N/A     46.864       325.979     318.378
GPU 07   45.095     46.475     46.049     46.713     46.907     47.182     46.786        N/A       325.207     315.664

RTotal  326.350    325.359    324.955    326.901    324.636    326.686    326.370    325.634      2606.891     315.664     319.297

Average   bandwidth (GPU Timed):   46.552 GB/s
Aggregate bandwidth (GPU Timed): 2606.891 GB/s
Aggregate bandwidth (CPU Timed): 1937.924 GB/s

Based on the above results we obtain the Inter-GPU xGMI Link Bandwidth as shown below,

GPU

0

1

2

3

4

5

6

7

Total

Actual

0

N/A

47.056

46.826

47.126

46.432

46.883

46.784

45.210

326.317

316.469

1

47.000

N/A

46.751

46.935

46.436

45.543

46.260

46.116

325.040

318.799

2

47.098

46.785

N/A

47.032

46.048

46.829

45.451

46.370

325.612

318.160

3

46.931

47.009

46.941

N/A

45.142

46.718

46.982

46.925

326.649

315.994

4

46.551

46.333

46.084

45.614

N/A

46.670

47.033

47.140

325.427

319.297

5

46.831

45.424

46.821

46.768

46.734

N/A

47.073

47.008

326.659

317.968

6

46.844

46.278

45.483

46.713

46.936

46.862

N/A

46.864

325.979

318.378

7

45.095

46.475

46.049

46.713

46.907

47.182

46.786

N/A

325.207

315.664

Let’s consider the interaction between GPU 0 to all other GPUs as represented in the below diagram, The slowest link from GPU 0 is between GPU 0 → GPU 7 = 45.21 GB / s which translates to an Actual Aggregated BW of ~ 316.47 GB / s ( 7 * 45.21 GB/s ) for your collective operation in an AI Training Workload although your total is ~326.317 GB/s. Since the slowest link is what defines the total bandwidth in this case, that causes the RCCL Bandwidth to be effectively capped at 310-330 GB/s.

GPU Bandwidth Analysis

Figure 1: xGMI Bandwidth between GPU 0 to other GPUs#

Furthermore, you can also use RCCL-Test to benchmark bandwidth and latency for various collective operations across different message sizes. The following snippets demonstrates running an AllReduce benchmark for message sizes ranging from 8 bytes to 16 GB, using one process per GPU, with a multiplicative factor of 2 between message sizes.

#AllReduce RCCL Test
mpirun -np 8 --bind-to numa -env NCCL_DEBUG=VERSION rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:448c4c7
# Using devices
#   Rank  0 Pid 2509534 on host device  0 [0000:0c:00.0] AMD Instinct MI300X
#   Rank  1 Pid 2509535 on host device  1 [0000:22:00.0] AMD Instinct MI300X
#   Rank  2 Pid 2509536 on host device  2 [0000:38:00.0] AMD Instinct MI300X
#   Rank  3 Pid 2509537 on host device  3 [0000:5c:00.0] AMD Instinct MI300X
#   Rank  4 Pid 2509538 on host device  4 [0000:9f:00.0] AMD Instinct MI300X
#   Rank  5 Pid 2509539 on host device  5 [0000:af:00.0] AMD Instinct MI300X
#   Rank  6 Pid 2509540 on host device  6 [0000:bf:00.0] AMD Instinct MI300X
#   Rank  7 Pid 2509541 on host device  7 [0000:df:00.0] AMD Instinct MI300X
RCCL version : 2.22.3-develop:f5b15f2
HIP version  : 6.3.42131-fa1d09cbd
ROCm version : 6.3.0.0-39-ce2be3b
Hostname     : host
Librccl path : /home/User/all2all/doc/rccl/build/release/librccl.so.1
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    16.61    0.00    0.00      0    16.53    0.00    0.00      0
          16             4     float     sum      -1    16.12    0.00    0.00      0    15.94    0.00    0.00      0
          32             8     float     sum      -1    16.56    0.00    0.00      0    16.68    0.00    0.00      0
          64            16     float     sum      -1    17.94    0.00    0.01      0    18.36    0.00    0.01      0
         128            32     float     sum      -1    22.94    0.01    0.01      0    22.78    0.01    0.01      0
         256            64     float     sum      -1    22.93    0.01    0.02      0    22.85    0.01    0.02      0
         512           128     float     sum      -1    22.91    0.02    0.04      0    23.20    0.02    0.04      0
        1024           256     float     sum      -1    13.11    0.08    0.14      0    12.79    0.08    0.14      0
        2048           512     float     sum      -1    13.09    0.16    0.27      0    12.96    0.16    0.28      0
        4096          1024     float     sum      -1    13.27    0.31    0.54      0    12.71    0.32    0.56      0
        8192          2048     float     sum      -1    13.30    0.62    1.08      0    12.89    0.64    1.11      0
       16384          4096     float     sum      -1    13.54    1.21    2.12      0    13.01    1.26    2.20      0
       32768          8192     float     sum      -1    13.31    2.46    4.31      0    13.10    2.50    4.38      0
       65536         16384     float     sum      -1    13.77    4.76    8.33      0    13.38    4.90    8.57      0
      131072         32768     float     sum      -1    14.14    9.27   16.22      0    13.77    9.52   16.66      0
      262144         65536     float     sum      -1    16.87   15.54   27.20      0    16.55   15.84   27.72      0
      524288        131072     float     sum      -1    24.41   21.48   37.59      0    23.88   21.95   38.42      0
     1048576        262144     float     sum      -1    27.42   38.25   66.93      0    27.13   38.65   67.64      0
     2097152        524288     float     sum      -1    34.81   60.25  105.43      0    34.84   60.20  105.35      0
     4194304       1048576     float     sum      -1    51.72   81.09  141.91      0    129.0   32.52   56.92      0
     8388608       2097152     float     sum      -1    83.68  100.24  175.43      0    86.67   96.78  169.37      0
    16777216       4194304     float     sum      -1    148.8  112.72  197.26      0    155.6  107.81  188.66      0
    33554432       8388608     float     sum      -1    244.1  137.47  240.57      0    256.0  131.07  229.38      0
    67108864      16777216     float     sum      -1    429.1  156.38  273.66      0    440.6  152.30  266.52      0
   134217728      33554432     float     sum      -1    797.4  168.32  294.55      0    805.8  166.56  291.48      0
   268435456      67108864     float     sum      -1   1537.1  174.64  305.62      0   1546.0  173.63  303.85      0
   536870912     134217728     float     sum      -1   3026.2  177.41  310.46      0   3034.2  176.94  309.64      0
  1073741824     268435456     float     sum      -1   5970.6  179.84  314.72      0   5974.5  179.72  314.51      0
  2147483648     536870912     float     sum      -1    11891  180.59  316.04      0    11895  180.54  315.94      0
  4294967296    1073741824     float     sum      -1    23680  181.37  317.40      0    23769  180.69  316.21      0
  8589934592    2147483648     float     sum      -1    47278  181.69  317.96      0    47323  181.52  317.66      0
17179869184    4294967296     float     sum      -1    94254  182.27  318.98      0    94328  182.13  318.73      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 116.669

You can further see that running with Graph mode enabled -G 1 flag will improve the latency in small message sizes where the collective operations are latency-bound.

mpirun -np 8 --bind-to numa -env NCCL_DEBUG=VERSION rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -G 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:448c4c7
# Using devices
#   Rank  0 Pid 2510387 on host device  0 [0000:0c:00.0] AMD Instinct MI300X
#   Rank  1 Pid 2510388 on host device  1 [0000:22:00.0] AMD Instinct MI300X
#   Rank  2 Pid 2510389 on host device  2 [0000:38:00.0] AMD Instinct MI300X
#   Rank  3 Pid 2510390 on host device  3 [0000:5c:00.0] AMD Instinct MI300X
#   Rank  4 Pid 2510391 on host device  4 [0000:9f:00.0] AMD Instinct MI300X
#   Rank  5 Pid 2510392 on host device  5 [0000:af:00.0] AMD Instinct MI300X
#   Rank  6 Pid 2510393 on host device  6 [0000:bf:00.0] AMD Instinct MI300X
#   Rank  7 Pid 2510394 on host device  7 [0000:df:00.0] AMD Instinct MI300X
RCCL version : 2.22.3-develop:f5b15f2
HIP version  : 6.3.42131-fa1d09cbd
ROCm version : 6.3.0.0-39-ce2be3b
Hostname     : host
Librccl path : /home/User/all2all/doc/rccl/build/release/librccl.so.1
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    16.38    0.00    0.00      0    16.29    0.00    0.00      0
          16             4     float     sum      -1    15.75    0.00    0.00      0    15.79    0.00    0.00      0
          32             8     float     sum      -1     6.32    0.01    0.01      0     6.17    0.01    0.01      0
          64            16     float     sum      -1     6.00    0.01    0.02      0     6.51    0.01    0.02      0
         128            32     float     sum      -1     6.53    0.02    0.03      0     6.20    0.02    0.04      0
         256            64     float     sum      -1     6.77    0.04    0.07      0     7.18    0.04    0.06      0
         512           128     float     sum      -1     7.44    0.07    0.12      0     7.84    0.07    0.11      0
        1024           256     float     sum      -1     7.47    0.14    0.24      0     7.76    0.13    0.23      0
        2048           512     float     sum      -1     7.46    0.27    0.48      0     7.54    0.27    0.48      0
        4096          1024     float     sum      -1     7.61    0.54    0.94      0     7.61    0.54    0.94      0
        8192          2048     float     sum      -1     7.64    1.07    1.88      0     8.10    1.01    1.77      0
       16384          4096     float     sum      -1     8.71    1.88    3.29      0     8.10    2.02    3.54      0
       32768          8192     float     sum      -1     9.05    3.62    6.33      0     8.79    3.73    6.52      0
       65536         16384     float     sum      -1     9.14    7.17   12.54      0     9.14    7.17   12.54      0
      131072         32768     float     sum      -1     9.51   13.78   24.12      0     9.53   13.75   24.06      0
      262144         65536     float     sum      -1    10.26   25.54   44.69      0    10.32   25.39   44.43      0
      524288        131072     float     sum      -1    14.26   36.75   64.32      0    14.64   35.81   62.66      0
     1048576        262144     float     sum      -1    19.40   54.06   94.61      0    19.72   53.18   93.06      0
     2097152        524288     float     sum      -1    35.03   59.86  104.76      0    34.83   60.20  105.35      0
     4194304       1048576     float     sum      -1    51.83   80.93  141.62      0    52.99   79.15  138.52      0
     8388608       2097152     float     sum      -1    83.74  100.17  175.30      0    86.66   96.80  169.41      0
    16777216       4194304     float     sum      -1    149.1  112.55  196.96      0    156.3  107.32  187.81      0
    33554432       8388608     float     sum      -1    243.4  137.85  241.23      0    255.5  131.33  229.82      0
    67108864      16777216     float     sum      -1    427.9  156.83  274.45      0    439.1  152.83  267.46      0
   134217728      33554432     float     sum      -1    795.3  168.77  295.35      0    805.9  166.54  291.44      0
   268435456      67108864     float     sum      -1   1534.4  174.94  306.15      0   1546.7  173.55  303.72      0
   536870912     134217728     float     sum      -1   3024.0  177.54  310.69      0   3034.5  176.92  309.62      0
  1073741824     268435456     float     sum      -1   5961.9  180.10  315.18      0   5974.7  179.71  314.50      0
  2147483648     536870912     float     sum      -1    11880  180.77  316.34      0    11897  180.50  315.88      0
  4294967296    1073741824     float     sum      -1    23658  181.55  317.71      0    23681  181.37  317.40      0
  8589934592    2147483648     float     sum      -1    47244  181.82  318.18      0    47287  181.65  317.89      0
17179869184    4294967296     float     sum      -1    94182  182.41  319.22      0    94540  181.72  318.01      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 120.69

Best Practices to Maximize RCCL Efficiency#

  • Utilization of all GPUs within a node

    • The MI300X architecture features dedicated links between GPUs, forming a fully connected topology.

    • For collective operations, the highest performance is achieved when all 8 GPUs are used, ensuring all inter GPU links are active.

    • When performing collective operations with only 2 or 4 GPUs, only a fraction of the available bandwidth on the node is utilized. But, if your workload limits you to use less than 8 MI300X GPUs on a system, you can use the run-time variable NCCL_MIN_NCHANNELS to increase the number of channels or CU’s used, thereby improving RCCL performance.

      • Use the following environment variable to increase the number of channels used by RCCL when using RCCL in single-node E2E workloads to potentially improve the performance: export NCCL_MIN_NCHANNELS=112

  • Disable NUMA Auto balancing

    • NUMA (Non-Uniform Memory Access) auto-balancing is designed to improve memory access efficiency on multi-socket systems. It dynamically moves memory pages between NUMA nodes to balance memory load across available processors. While this is beneficial for some workloads, it can introduce performance overheads in GPU-specific collective operations. Disabling NUMA Auto balancing is required for reducing performance variability and to achieve better results.

    • How to check if NUMA Auto balancing is disabled?

      • Run the following command to check its status

        cat /proc/sys/kernel/numa_balancing
        
        • If the output is 0, NUMA auto-balancing is already disabled.

        • If the output is 1, NUMA auto-balancing is enabled and should be turned off.

      • To disable it run the following command.

        sudo sysctl kernel.numa_balancing=0
        
      • Note: Please make sure daemons like numad are also disabled.

  • Use a Single Process per GPU

    • RCCL delivers the best performance for collectives when it is configured in a one-process-per-GPU mode. This is because, with a one-process-per-multiple-GPUs configuration, you can run into kernel launch latency issues, this is because ROCm serializes kernel launches on multiple GPUs from one process which hurts performance. The examples are specific to RCCL-Tests only.

      • How to enable Single Process per GPU mode:

      mpirun -np #number_of_GPUs ./build/all_reduce_perf -b minbytes -e maxbytes -f multiplier -g 1
      
      • If one-process-per-GPU mode is not feasible, a single-process, multi-threaded approach provides better performance than a single-threaded single-process

      mpirun -np #number_of_GPUs ./build/all_reduce_perf -b minbytes -e maxbytes -f multiplier -g 1 -t threads_per_process
      
      • Enabling Graph could further improve performance by saving on kernel launch overhead:

      mpirun -np #number_of_GPUs ./build/all_reduce_perf -b minbytes -e maxbytes -f multiplier -g gpus_per_thread -t threads_per_process -G number_of_kernel_graph_launch
      

Resources#

Summary#

In this blog, we talked about xGMI bandwidth and RCCL performance on the MI300X giving key insights into inter-GPU communication efficiency, highlighting both theoretical expectations and real-world performance considerations. While xGMI provides a high-bandwidth, low-latency interconnect, practical limitations such as protocol overhead and bit correction and the slowest link constraints impact overall bandwidth utilization in AI and HPC workloads.

We also walked through the benchmarking with TransferBench and RCCL-Test to understand xGMI bandwidth and RCCL performance, we observed that the aggregated theoretical bandwidth per GPU reaches 336 GB/s, and practical execution yields values between 310-330 GB/s, largely dictated by the slow link in the communication topology. By leveraging efficient configurations as mentioned in the best practices above users can further enhance bandwidth utilization and maximize efficiency. These optimizations empower developers to fully harness the MI300X platform, driving faster AI training and high-performance computing workloads with greater scalability and efficiency.