GEMM Tuning within hipBLASLt– Part 2

GEMM Tuning within hipBLASLt– Part 2#

October 09, 2025 by Chia Hung, YangWen Huang, Carson Liao.

2 min read. | 378 total words.

AI/ML, Developers

AI, Developers

This post continues from Part 1 where we introduced GEMM tuning concepts in hipBLASLt and explored the basics of solution search. In Part 2, we focus on offline tuning with the hipblaslt-bench tool. This workflow allows developers to benchmark candidate GEMM kernels for specific problem shapes, capture the best-performing solutions, and reuse them at runtime without rebuilding or modifying the hipBLASLt library.

Using hipBLASLt Offline Tuning with `hipblaslt-bench`#

The hipblaslt-bench tool enables developers to find the best-performing GEMM kernel for specific problem sizes. The output—known as the solution index—can be used in future GEMM calls via the HIPBLASLT_TUNING_OVERRIDE_FILE mechanism.

These solution indices are not guaranteed to remain valid across ROCm versions. You must re-run tuning whenever you upgrade.

Workflow Overview#

Enable logging to capture GEMM shapes:
```
export HIPBLASLT_LOG_MASK=32
```

Run your GEMM operation or app. The log will emit a hipblaslt-bench command like:

hipblaslt-bench --api_method c -m 1024 -n 512 -k 1024 \
  --lda 1024 --ldb 1024 --ldc 1024 --alpha 1.0 --beta 1.0 \
  --transA N --transB N --batch_count 1 \
  --a_type f16_r --b_type f16_r --c_type f16_r --d_type f16_r \
  --scale_type f32_r --bias_type f32_r \
  --compute_type f32_r --algo_method index \
  --solution_index <<<INDEX>>>

Set the following environment variable before running the hipblaslt-bench command from your GEMM operation or application. This enables tuning mode and saves the best solution index:
```
export HIPBLASLT_TUNING_FILE=tuning.txt
```
This will generate a tuning.txt file containing the tuned solution index after the benchmark run.
To apply the tuned result at runtime, unset the tuning variable and set the override variable:
```
unset HIPBLASLT_TUNING_FILE
export HIPBLASLT_TUNING_OVERRIDE_FILE=tuning.txt
```
This allows hipBLASLt to override the default solution with the one stored in tuning.txt. If --algo_method heuristic was used during benchmarking, the runtime will override the default heuristic result with the pre-selected solution index found in the file. This also affects runtime behavior when using the C API hipblasLtMatmulAlgoGetHeuristic or the C++ API algoGetHeuristic—these functions will return the tuned solution if a matching entry exists in the override file.

Example Summary#

# Step 1: Enable logging
export HIPBLASLT_LOG_MASK=32

# Step 2: Run the benchmarked GEMM
./my_gemm_app

# Step 3: Set to tuning mode
export HIPBLASLT_TUNING_FILE=tuning.txt

# Step 4: At runtime, override the default logic
unset HIPBLASLT_TUNING_FILE
export HIPBLASLT_TUNING_OVERRIDE_FILE=tuning.txt

Once enabled, your GEMM calls will use custom-tuned kernels without changing library binaries.

Advantages & Limitations#

Feature	Description
Easy deployment	No need to rebuild the library; just load a tuning file at runtime
Re-tuning required after upgrade	Solution indices may change between ROCm versions

Summary#

The hipblaslt-bench offline tuning approach is optimal when you want runtime flexibility without modifying or recompiling the library. It supports easy deployment of tuned kernels for stable GEMM workloads and allows you to update your tuning results independently of the library release. This makes it a practical choice when you need quick performance gains with minimal setup effort.

However, for maximum long-term performance and consistency—especially if you’re managing multiple library versions or hardware generations—Part 1’s find_exact.py workflow may offer more control.

References: