Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct™ MI300X

Understanding RCCL Bandwidth and xGMI Performance on AMD Instinct™ MI300X#

March 02, 2025 by Jayacharan Kolla, Pedram Alizadeh, Gilbert Lee.

5 min read. | 1109 total words.

System-Tuning, Performance, Optimization

HPC, Data Science

Efficient inter-GPU communication is the backbone of high-performance AI and HPC workloads, where technologies like RCCL and xGMI play pivotal roles. However, some limitations in achieving theoretical peak bandwidth have raised questions about performance bottlenecks. In this blog we explain the limitations to achieve the theoretical maximum bandwidth in multi-GPU clusters, and teach you how to perform a set of diagnostics and performance-tuning strategies that will help you optimize RCCL and xGMI bandwidth on AMD MI300X systems. We will first introduce you to xGMI and its performance constraints, to RCCL and its bandwidth limitations, and then cover several practical benchmarks and best practices for maximizing RCCL efficiency.

xGMI Overview and Performance Expectations#

What is xGMI?#

xGMI (External Global Memory Interconnect) is AMD’s high-speed GPU-to-GPU interconnect based on Infinity Fabric™ technology. It is designed to enable efficient peer-to-peer GPU communication for multi-GPU AI and HPC workloads. This interconnect is crucial for ensuring that large-scale AI models and distributed HPC simulations run efficiently across multiple GPUs. Each AMD Instinct MI300X system GPU is connected to its seven peer GPUs via xGMI links, forming a fully connected mesh with high-bandwidth, low-latency communication. For more information related to the MI300X Platform, refer to this datasheet.

Theoretical Performance Claims vs. Achievable#

xGMI’s raw bandwidth is rated up to 64 GB/s for each point-to-point link between GPUs. In an ideal situation, this means that each GPU in a system can communicate with every other GPU at this peak rate. However, in reality, this bandwidth is constrained by factors like CRC bit correction and protocol overhead, which reduce the usable bandwidth to approximately 48 GB/s per link. This represents about 75% of the theoretical peak bandwidth, which is a critical factor when evaluating performance in practical workloads.

Specification	Theoretical Max	Realized Performance
Aggregated unidirectional bandwidth per GPU	448 GB/s (7×64 GB/s)	315 - 336 GB/s (7× 45-48 GB/s)
GPU-GPU Unidirectional BW via xGMI Link	64 GB/s	45 - 48 GB/s

Why Does RCCL Bandwidth shuffle between 310-330 GB/s for the best-case scenario?#

While in reality, the aggregated unidirectional bandwidth per GPU is 336 GB/s, in practical AI workloads, the slowest link limits the total available bandwidth.

Example: Let’s practically test out the xGMI Link performances between GPUs in a single-node MI300X machine. For this experimentation, we will be using the Transferbench tool. TransferBench. is a benchmarking utility designed to measure the performance of simultaneous data transfers between user-specified devices, such as CPUs and GPUs. For this example, we will use Transferbench to identify Inter-GPU xGMI scale-up link bandwidth. Refer to these steps to install and build Transferbench.

The following TransferBench command can be used on MI300X systems to measure the achievable bandwidth between any pair of directly connected GPUs, as well as the total All-to-All aggregate bandwidth. This example below evaluates AllToAll bandwidth using fine-grained memory, an unroll factor = 2, and 8 Compute Units (CUs) per GPU. The results provide an excellent upper-bound estimate of the bandwidth that RCCL can achieve.

#Transferbench All to All Run
./TransferBench a2a 64M 8
TransferBench v1.60.00 (with NIC support)
===============================================================
[Common]                              (Suppress by setting HIDE_ENV=1)
ALWAYS_VALIDATE      =            0 : Validating after all iterations
BLOCK_BYTES          =          256 : Each CU gets a mulitple of 256 bytes to copy
BYTE_OFFSET          =            0 : Using byte offset of 0
CLOSEST_NIC          =         auto : Per-GPU closest NIC is set as auto
CU_MASK              =            0 : All
FILL_PATTERN         =            0 : Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)
GFX_BLOCK_SIZE       =          256 : Threadblock size of 256
GFX_SINGLE_TEAM      =            1 : Combining CUs to work across entire data array
GFX_UNROLL           =            2 : Using GFX unroll factor of 2
GFX_WAVE_ORDER       =            0 : Using GFX wave ordering of Unroll,Wavefront,CU
IP_ADDRESS_FAMILY    =            4 : IP address family is set to IPv4
IB_GID_INDEX         =           -1 : RoCE GID index is set to auto
IB_PORT_NUMBER       =            1 : IB port number is set to 1
MIN_VAR_SUBEXEC      =            1 : Using at least 1 subexecutor(s) for variable subExec tranfers
MAX_VAR_SUBEXEC      =            0 : Using up to all available subexecutors for variable subExec transfers
NIC_RELAX_ORDER      =            1 : Using relaxed ordering for NIC RDMA
NUM_ITERATIONS       =           10 : Running 10  timed iteration(s)
NUM_SUBITERATIONS    =            1 : Running 1 subiterations
NUM_WARMUPS          =            3 : Running 3 warmup iteration(s) per Test
ROCE_VERSION         =            2 : RoCE version is set to 2
SHOW_ITERATIONS      =            0 : Hiding per-iteration timing
USE_HIP_EVENTS       =            1 : Using HIP events for GFX/DMA Executor timing
USE_HSA_DMA          =            0 : Using hipMemcpyAsync for DMA execution
USE_INTERACTIVE      =            0 : Running in non-interactive mode
USE_SINGLE_STREAM    =            1 : Using single stream per GFX device
VALIDATE_DIRECT      =            0 : Validate GPU destination memory via CPU staging buffer
VALIDATE_SOURCE      =            0 : Do not perform source validation after prep

[AllToAll Related]
A2A_DIRECT           =            1 : Only using direct links
A2A_LOCAL            =            0 : Exclude local transfers
A2A_MODE             =            0 : Copy
NUM_GPU_DEVICES      =            8 : Using 8 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_FINE_GRAIN       =            1 : Using fine-grained memory
USE_REMOTE_READ      =            0 : Using SRC as executor

GPU-GFX All-To-All benchmark:
==========================
- Copying 67108864 bytes between directly connected pairs of GPUs using 8 CUs (56 Transfers)
Test 1:
 Executor: GPU 00 |  315.552 GB/s |    1.489 ms |    469762048 bytes | 326.317 GB/s (sum)
     Transfer 00  |   47.056 GB/s |    1.426 ms |     67108864 bytes | F0 -> G000:008 -> F1
     Transfer 01  |   46.826 GB/s |    1.433 ms |     67108864 bytes | F0 -> G000:008 -> F2
     Transfer 02  |   47.126 GB/s |    1.424 ms |     67108864 bytes | F0 -> G000:008 -> F3
     Transfer 03  |   46.432 GB/s |    1.445 ms |     67108864 bytes | F0 -> G000:008 -> F4
     Transfer 04  |   46.883 GB/s |    1.431 ms |     67108864 bytes | F0 -> G000:008 -> F5
     Transfer 05  |   46.784 GB/s |    1.434 ms |     67108864 bytes | F0 -> G000:008 -> F6
     Transfer 06  |   45.210 GB/s |    1.484 ms |     67108864 bytes | F0 -> G000:008 -> F7
 Executor: GPU 01 |  317.663 GB/s |    1.479 ms |    469762048 bytes | 325.040 GB/s (sum)
     Transfer 07  |   47.000 GB/s |    1.428 ms |     67108864 bytes | F1 -> G001:008 -> F0
     Transfer 08  |   46.751 GB/s |    1.435 ms |     67108864 bytes | F1 -> G001:008 -> F2
     Transfer 09  |   46.935 GB/s |    1.430 ms |     67108864 bytes | F1 -> G001:008 -> F3
     Transfer 10  |   46.436 GB/s |    1.445 ms |     67108864 bytes | F1 -> G001:008 -> F4
     Transfer 11  |   45.543 GB/s |    1.474 ms |     67108864 bytes | F1 -> G001:008 -> F5
     Transfer 12  |   46.260 GB/s |    1.451 ms |     67108864 bytes | F1 -> G001:008 -> F6
     Transfer 13  |   46.116 GB/s |    1.455 ms |     67108864 bytes | F1 -> G001:008 -> F7
 Executor: GPU 02 |  317.041 GB/s |    1.482 ms |    469762048 bytes | 325.612 GB/s (sum)
     Transfer 14  |   47.098 GB/s |    1.425 ms |     67108864 bytes | F2 -> G002:008 -> F0
     Transfer 15  |   46.785 GB/s |    1.434 ms |     67108864 bytes | F2 -> G002:008 -> F1
     Transfer 16  |   47.032 GB/s |    1.427 ms |     67108864 bytes | F2 -> G002:008 -> F3
     Transfer 17  |   46.048 GB/s |    1.457 ms |     67108864 bytes | F2 -> G002:008 -> F4
     Transfer 18  |   46.829 GB/s |    1.433 ms |     67108864 bytes | F2 -> G002:008 -> F5
     Transfer 19  |   45.451 GB/s |    1.476 ms |     67108864 bytes | F2 -> G002:008 -> F6
     Transfer 20  |   46.370 GB/s |    1.447 ms |     67108864 bytes | F2 -> G002:008 -> F7
 Executor: GPU 03 |  315.094 GB/s |    1.491 ms |    469762048 bytes | 326.649 GB/s (sum)
     Transfer 21  |   46.931 GB/s |    1.430 ms |     67108864 bytes | F3 -> G003:008 -> F0
     Transfer 22  |   47.009 GB/s |    1.428 ms |     67108864 bytes | F3 -> G003:008 -> F1
     Transfer 23  |   46.941 GB/s |    1.430 ms |     67108864 bytes | F3 -> G003:008 -> F2
     Transfer 24  |   45.142 GB/s |    1.487 ms |     67108864 bytes | F3 -> G003:008 -> F4
     Transfer 25  |   46.718 GB/s |    1.436 ms |     67108864 bytes | F3 -> G003:008 -> F5
     Transfer 26  |   46.982 GB/s |    1.428 ms |     67108864 bytes | F3 -> G003:008 -> F6
     Transfer 27  |   46.925 GB/s |    1.430 ms |     67108864 bytes | F3 -> G003:008 -> F7
 Executor: GPU 04 |  318.348 GB/s |    1.476 ms |    469762048 bytes | 325.427 GB/s (sum)
     Transfer 28  |   46.551 GB/s |    1.442 ms |     67108864 bytes | F4 -> G004:008 -> F0
     Transfer 29  |   46.333 GB/s |    1.448 ms |     67108864 bytes | F4 -> G004:008 -> F1
     Transfer 30  |   46.084 GB/s |    1.456 ms |     67108864 bytes | F4 -> G004:008 -> F2
     Transfer 31  |   45.614 GB/s |    1.471 ms |     67108864 bytes | F4 -> G004:008 -> F3
     Transfer 32  |   46.670 GB/s |    1.438 ms |     67108864 bytes | F4 -> G004:008 -> F5
     Transfer 33  |   47.033 GB/s |    1.427 ms |     67108864 bytes | F4 -> G004:008 -> F6
     Transfer 34  |   47.140 GB/s |    1.424 ms |     67108864 bytes | F4 -> G004:008 -> F7
 Executor: GPU 05 |  317.032 GB/s |    1.482 ms |    469762048 bytes | 326.659 GB/s (sum)
     Transfer 35  |   46.831 GB/s |    1.433 ms |     67108864 bytes | F5 -> G005:008 -> F0
     Transfer 36  |   45.424 GB/s |    1.477 ms |     67108864 bytes | F5 -> G005:008 -> F1
     Transfer 37  |   46.821 GB/s |    1.433 ms |     67108864 bytes | F5 -> G005:008 -> F2
     Transfer 38  |   46.768 GB/s |    1.435 ms |     67108864 bytes | F5 -> G005:008 -> F3
     Transfer 39  |   46.734 GB/s |    1.436 ms |     67108864 bytes | F5 -> G005:008 -> F4
     Transfer 40  |   47.073 GB/s |    1.426 ms |     67108864 bytes | F5 -> G005:008 -> F6
     Transfer 41  |   47.008 GB/s |    1.428 ms |     67108864 bytes | F5 -> G005:008 -> F7
 Executor: GPU 06 |  317.454 GB/s |    1.480 ms |    469762048 bytes | 325.979 GB/s (sum)
     Transfer 42  |   46.844 GB/s |    1.433 ms |     67108864 bytes | F6 -> G006:008 -> F0
     Transfer 43  |   46.278 GB/s |    1.450 ms |     67108864 bytes | F6 -> G006:008 -> F1
     Transfer 44  |   45.483 GB/s |    1.475 ms |     67108864 bytes | F6 -> G006:008 -> F2
     Transfer 45  |   46.713 GB/s |    1.437 ms |     67108864 bytes | F6 -> G006:008 -> F3
     Transfer 46  |   46.936 GB/s |    1.430 ms |     67108864 bytes | F6 -> G006:008 -> F4
     Transfer 47  |   46.862 GB/s |    1.432 ms |     67108864 bytes | F6 -> G006:008 -> F5
     Transfer 48  |   46.864 GB/s |    1.432 ms |     67108864 bytes | F6 -> G006:008 -> F7
 Executor: GPU 07 |  314.759 GB/s |    1.492 ms |    469762048 bytes | 325.207 GB/s (sum)
     Transfer 49  |   45.095 GB/s |    1.488 ms |     67108864 bytes | F7 -> G007:008 -> F0
     Transfer 50  |   46.475 GB/s |    1.444 ms |     67108864 bytes | F7 -> G007:008 -> F1
     Transfer 51  |   46.049 GB/s |    1.457 ms |     67108864 bytes | F7 -> G007:008 -> F2
     Transfer 52  |   46.713 GB/s |    1.437 ms |     67108864 bytes | F7 -> G007:008 -> F3
     Transfer 53  |   46.907 GB/s |    1.431 ms |     67108864 bytes | F7 -> G007:008 -> F4
     Transfer 54  |   47.182 GB/s |    1.422 ms |     67108864 bytes | F7 -> G007:008 -> F5
     Transfer 55  |   46.786 GB/s |    1.434 ms |     67108864 bytes | F7 -> G007:008 -> F6
 Aggregate (CPU)  | 1937.924 GB/s |    1.939 ms |   3758096384 bytes | Overhead: 0.447 ms

Summary: [67108864 bytes per Transfer]
==========================================================
SRC\DST  GPU 00     GPU 01     GPU 02     GPU 03     GPU 04     GPU 05     GPU 06     GPU 07        STotal      Actual
GPU 00      N/A     47.056     46.826     47.126     46.432     46.883     46.784     45.210       326.317     316.469
GPU 01   47.000        N/A     46.751     46.935     46.436     45.543     46.260     46.116       325.040     318.799
GPU 02   47.098     46.785        N/A     47.032     46.048     46.829     45.451     46.370       325.612     318.160
GPU 03   46.931     47.009     46.941        N/A     45.142     46.718     46.982     46.925       326.649     315.994
GPU 04   46.551     46.333     46.084     45.614        N/A     46.670     47.033     47.140       325.427     319.297
GPU 05   46.831     45.424     46.821     46.768     46.734        N/A     47.073     47.008       326.659     317.968
GPU 06   46.844     46.278     45.483     46.713     46.936     46.862        N/A     46.864       325.979     318.378
GPU 07   45.095     46.475     46.049     46.713     46.907     47.182     46.786        N/A       325.207     315.664

RTotal  326.350    325.359    324.955    326.901    324.636    326.686    326.370    325.634      2606.891     315.664     319.297

Average   bandwidth (GPU Timed):   46.552 GB/s
Aggregate bandwidth (GPU Timed): 2606.891 GB/s
Aggregate bandwidth (CPU Timed): 1937.924 GB/s

Based on the above results we obtain the Inter-GPU xGMI Link Bandwidth as shown below,

GPU	0	1	2	3	4	5	6	7	Total	Actual
0	N/A	47.056	46.826	47.126	46.432	46.883	46.784	45.210	326.317	316.469
1	47.000	N/A	46.751	46.935	46.436	45.543	46.260	46.116	325.040	318.799
2	47.098	46.785	N/A	47.032	46.048	46.829	45.451	46.370	325.612	318.160
3	46.931	47.009	46.941	N/A	45.142	46.718	46.982	46.925	326.649	315.994
4	46.551	46.333	46.084	45.614	N/A	46.670	47.033	47.140	325.427	319.297
5	46.831	45.424	46.821	46.768	46.734	N/A	47.073	47.008	326.659	317.968
6	46.844	46.278	45.483	46.713	46.936	46.862	N/A	46.864	325.979	318.378
7	45.095	46.475	46.049	46.713	46.907	47.182	46.786	N/A	325.207	315.664

Let’s consider the interaction between GPU 0 to all other GPUs as represented in the below diagram, The slowest link from GPU 0 is between GPU 0 → GPU 7 = 45.21 GB / s which translates to an Actual Aggregated BW of ~ 316.47 GB / s ( 7 * 45.21 GB/s ) for your collective operation in an AI Training Workload although your total is ~326.317 GB/s. Since the slowest link is what defines the total bandwidth in this case, that causes the RCCL Bandwidth to be effectively capped at 310-330 GB/s.

GPU Bandwidth Analysis — Figure 1: xGMI Bandwidth between GPU 0 to other GPUs#

Furthermore, you can also use RCCL-Test to benchmark bandwidth and latency for various collective operations across different message sizes. The following snippets demonstrates running an AllReduce benchmark for message sizes ranging from 8 bytes to 16 GB, using one process per GPU, with a multiplicative factor of 2 between message sizes.

#AllReduce RCCL Test
mpirun -np 8 --bind-to numa -env NCCL_DEBUG=VERSION rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:448c4c7
# Using devices
#   Rank  0 Pid 2509534 on host device  0 [0000:0c:00.0] AMD Instinct MI300X
#   Rank  1 Pid 2509535 on host device  1 [0000:22:00.0] AMD Instinct MI300X
#   Rank  2 Pid 2509536 on host device  2 [0000:38:00.0] AMD Instinct MI300X
#   Rank  3 Pid 2509537 on host device  3 [0000:5c:00.0] AMD Instinct MI300X
#   Rank  4 Pid 2509538 on host device  4 [0000:9f:00.0] AMD Instinct MI300X
#   Rank  5 Pid 2509539 on host device  5 [0000:af:00.0] AMD Instinct MI300X
#   Rank  6 Pid 2509540 on host device  6 [0000:bf:00.0] AMD Instinct MI300X
#   Rank  7 Pid 2509541 on host device  7 [0000:df:00.0] AMD Instinct MI300X
RCCL version : 2.22.3-develop:f5b15f2
HIP version  : 6.3.42131-fa1d09cbd
ROCm version : 6.3.0.0-39-ce2be3b
Hostname     : host
Librccl path : /home/User/all2all/doc/rccl/build/release/librccl.so.1
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    16.61    0.00    0.00      0    16.53    0.00    0.00      0
          16             4     float     sum      -1    16.12    0.00    0.00      0    15.94    0.00    0.00      0
          32             8     float     sum      -1    16.56    0.00    0.00      0    16.68    0.00    0.00      0
          64            16     float     sum      -1    17.94    0.00    0.01      0    18.36    0.00    0.01      0
         128            32     float     sum      -1    22.94    0.01    0.01      0    22.78    0.01    0.01      0
         256            64     float     sum      -1    22.93    0.01    0.02      0    22.85    0.01    0.02      0
         512           128     float     sum      -1    22.91    0.02    0.04      0    23.20    0.02    0.04      0
        1024           256     float     sum      -1    13.11    0.08    0.14      0    12.79    0.08    0.14      0
        2048           512     float     sum      -1    13.09    0.16    0.27      0    12.96    0.16    0.28      0
        4096          1024     float     sum      -1    13.27    0.31    0.54      0    12.71    0.32    0.56      0
        8192          2048     float     sum      -1    13.30    0.62    1.08      0    12.89    0.64    1.11      0
       16384          4096     float     sum      -1    13.54    1.21    2.12      0    13.01    1.26    2.20      0
       32768          8192     float     sum      -1    13.31    2.46    4.31      0    13.10    2.50    4.38      0
       65536         16384     float     sum      -1    13.77    4.76    8.33      0    13.38    4.90    8.57      0
      131072         32768     float     sum      -1    14.14    9.27   16.22      0    13.77    9.52   16.66      0
      262144         65536     float     sum      -1    16.87   15.54   27.20      0    16.55   15.84   27.72      0
      524288        131072     float     sum      -1    24.41   21.48   37.59      0    23.88   21.95   38.42      0
     1048576        262144     float     sum      -1    27.42   38.25   66.93      0    27.13   38.65   67.64      0
     2097152        524288     float     sum      -1    34.81   60.25  105.43      0    34.84   60.20  105.35      0
     4194304       1048576     float     sum      -1    51.72   81.09  141.91      0    129.0   32.52   56.92      0
     8388608       2097152     float     sum      -1    83.68  100.24  175.43      0    86.67   96.78  169.37      0
    16777216       4194304     float     sum      -1    148.8  112.72  197.26      0    155.6  107.81  188.66      0
    33554432       8388608     float     sum      -1    244.1  137.47  240.57      0    256.0  131.07  229.38      0
    67108864      16777216     float     sum      -1    429.1  156.38  273.66      0    440.6  152.30  266.52      0
   134217728      33554432     float     sum      -1    797.4  168.32  294.55      0    805.8  166.56  291.48      0
   268435456      67108864     float     sum      -1   1537.1  174.64  305.62      0   1546.0  173.63  303.85      0
   536870912     134217728     float     sum      -1   3026.2  177.41  310.46      0   3034.2  176.94  309.64      0
  1073741824     268435456     float     sum      -1   5970.6  179.84  314.72      0   5974.5  179.72  314.51      0
  2147483648     536870912     float     sum      -1    11891  180.59  316.04      0    11895  180.54  315.94      0
  4294967296    1073741824     float     sum      -1    23680  181.37  317.40      0    23769  180.69  316.21      0
  8589934592    2147483648     float     sum      -1    47278  181.69  317.96      0    47323  181.52  317.66      0
17179869184    4294967296     float     sum      -1    94254  182.27  318.98      0    94328  182.13  318.73      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 116.669

You can further see that running with Graph mode enabled -G 1 flag will improve the latency in small message sizes where the collective operations are latency-bound.

mpirun -np 8 --bind-to numa -env NCCL_DEBUG=VERSION rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -G 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:448c4c7
# Using devices
#   Rank  0 Pid 2510387 on host device  0 [0000:0c:00.0] AMD Instinct MI300X
#   Rank  1 Pid 2510388 on host device  1 [0000:22:00.0] AMD Instinct MI300X
#   Rank  2 Pid 2510389 on host device  2 [0000:38:00.0] AMD Instinct MI300X
#   Rank  3 Pid 2510390 on host device  3 [0000:5c:00.0] AMD Instinct MI300X
#   Rank  4 Pid 2510391 on host device  4 [0000:9f:00.0] AMD Instinct MI300X
#   Rank  5 Pid 2510392 on host device  5 [0000:af:00.0] AMD Instinct MI300X
#   Rank  6 Pid 2510393 on host device  6 [0000:bf:00.0] AMD Instinct MI300X
#   Rank  7 Pid 2510394 on host device  7 [0000:df:00.0] AMD Instinct MI300X
RCCL version : 2.22.3-develop:f5b15f2
HIP version  : 6.3.42131-fa1d09cbd
ROCm version : 6.3.0.0-39-ce2be3b
Hostname     : host
Librccl path : /home/User/all2all/doc/rccl/build/release/librccl.so.1
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    16.38    0.00    0.00      0    16.29    0.00    0.00      0
          16             4     float     sum      -1    15.75    0.00    0.00      0    15.79    0.00    0.00      0
          32             8     float     sum      -1     6.32    0.01    0.01      0     6.17    0.01    0.01      0
          64            16     float     sum      -1     6.00    0.01    0.02      0     6.51    0.01    0.02      0
         128            32     float     sum      -1     6.53    0.02    0.03      0     6.20    0.02    0.04      0
         256            64     float     sum      -1     6.77    0.04    0.07      0     7.18    0.04    0.06      0
         512           128     float     sum      -1     7.44    0.07    0.12      0     7.84    0.07    0.11      0
        1024           256     float     sum      -1     7.47    0.14    0.24      0     7.76    0.13    0.23      0
        2048           512     float     sum      -1     7.46    0.27    0.48      0     7.54    0.27    0.48      0
        4096          1024     float     sum      -1     7.61    0.54    0.94      0     7.61    0.54    0.94      0
        8192          2048     float     sum      -1     7.64    1.07    1.88      0     8.10    1.01    1.77      0
       16384          4096     float     sum      -1     8.71    1.88    3.29      0     8.10    2.02    3.54      0
       32768          8192     float     sum      -1     9.05    3.62    6.33      0     8.79    3.73    6.52      0
       65536         16384     float     sum      -1     9.14    7.17   12.54      0     9.14    7.17   12.54      0
      131072         32768     float     sum      -1     9.51   13.78   24.12      0     9.53   13.75   24.06      0
      262144         65536     float     sum      -1    10.26   25.54   44.69      0    10.32   25.39   44.43      0
      524288        131072     float     sum      -1    14.26   36.75   64.32      0    14.64   35.81   62.66      0
     1048576        262144     float     sum      -1    19.40   54.06   94.61      0    19.72   53.18   93.06      0
     2097152        524288     float     sum      -1    35.03   59.86  104.76      0    34.83   60.20  105.35      0
     4194304       1048576     float     sum      -1    51.83   80.93  141.62      0    52.99   79.15  138.52      0
     8388608       2097152     float     sum      -1    83.74  100.17  175.30      0    86.66   96.80  169.41      0
    16777216       4194304     float     sum      -1    149.1  112.55  196.96      0    156.3  107.32  187.81      0
    33554432       8388608     float     sum      -1    243.4  137.85  241.23      0    255.5  131.33  229.82      0
    67108864      16777216     float     sum      -1    427.9  156.83  274.45      0    439.1  152.83  267.46      0
   134217728      33554432     float     sum      -1    795.3  168.77  295.35      0    805.9  166.54  291.44      0
   268435456      67108864     float     sum      -1   1534.4  174.94  306.15      0   1546.7  173.55  303.72      0
   536870912     134217728     float     sum      -1   3024.0  177.54  310.69      0   3034.5  176.92  309.62      0
  1073741824     268435456     float     sum      -1   5961.9  180.10  315.18      0   5974.7  179.71  314.50      0
  2147483648     536870912     float     sum      -1    11880  180.77  316.34      0    11897  180.50  315.88      0
  4294967296    1073741824     float     sum      -1    23658  181.55  317.71      0    23681  181.37  317.40      0
  8589934592    2147483648     float     sum      -1    47244  181.82  318.18      0    47287  181.65  317.89      0
17179869184    4294967296     float     sum      -1    94182  182.41  319.22      0    94540  181.72  318.01      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 120.69

Best Practices to Maximize RCCL Efficiency#

Utilization of all GPUs within a node
- The MI300X architecture features dedicated links between GPUs, forming a fully connected topology.
- For collective operations, the highest performance is achieved when all 8 GPUs are used, ensuring all inter GPU links are active.
- When performing collective operations with only 2 or 4 GPUs, only a fraction of the available bandwidth on the node is utilized. But, if your workload limits you to use less than 8 MI300X GPUs on a system, you can use the run-time variable NCCL_MIN_NCHANNELS to increase the number of channels or CU’s used, thereby improving RCCL performance.
  - Use the following environment variable to increase the number of channels used by RCCL when using RCCL in single-node E2E workloads to potentially improve the performance: export NCCL_MIN_NCHANNELS=112
Disable NUMA Auto balancing
- NUMA (Non-Uniform Memory Access) auto-balancing is designed to improve memory access efficiency on multi-socket systems. It dynamically moves memory pages between NUMA nodes to balance memory load across available processors. While this is beneficial for some workloads, it can introduce performance overheads in GPU-specific collective operations. Disabling NUMA Auto balancing is required for reducing performance variability and to achieve better results.
- How to check if NUMA Auto balancing is disabled?
  - Run the following command to check its status
    cat /proc/sys/kernel/numa_balancing
    - If the output is 0, NUMA auto-balancing is already disabled.
    - If the output is 1, NUMA auto-balancing is enabled and should be turned off.
  - To disable it run the following command.
    sudo sysctl kernel.numa_balancing=0
  - Note: Please make sure daemons like numad are also disabled.
Use a Single Process per GPU
- RCCL delivers the best performance for collectives when it is configured in a one-process-per-GPU mode. This is because, with a one-process-per-multiple-GPUs configuration, you can run into kernel launch latency issues, this is because ROCm serializes kernel launches on multiple GPUs from one process which hurts performance. The examples are specific to RCCL-Tests only.
  - How to enable Single Process per GPU mode:
```
mpirun -np #number_of_GPUs ./build/all_reduce_perf -b minbytes -e maxbytes -f multiplier -g 1
```
  - If one-process-per-GPU mode is not feasible, a single-process, multi-threaded approach provides better performance than a single-threaded single-process
```
mpirun -np #number_of_GPUs ./build/all_reduce_perf -b minbytes -e maxbytes -f multiplier -g 1 -t threads_per_process
```
  - Enabling Graph could further improve performance by saving on kernel launch overhead:
```
mpirun -np #number_of_GPUs ./build/all_reduce_perf -b minbytes -e maxbytes -f multiplier -g gpus_per_thread -t threads_per_process -G number_of_kernel_graph_launch
```

Resources#

Summary#

In this blog, we talked about xGMI bandwidth and RCCL performance on the MI300X giving key insights into inter-GPU communication efficiency, highlighting both theoretical expectations and real-world performance considerations. While xGMI provides a high-bandwidth, low-latency interconnect, practical limitations such as protocol overhead and bit correction and the slowest link constraints impact overall bandwidth utilization in AI and HPC workloads.

We also walked through the benchmarking with TransferBench and RCCL-Test to understand xGMI bandwidth and RCCL performance, we observed that the aggregated theoretical bandwidth per GPU reaches 336 GB/s, and practical execution yields values between 310-330 GB/s, largely dictated by the slow link in the communication topology. By leveraging efficient configurations as mentioned in the best practices above users can further enhance bandwidth utilization and maximize efficiency. These optimizations empower developers to fully harness the MI300X platform, driving faster AI training and high-performance computing workloads with greater scalability and efficiency.