GEMM Tuning within hipBLASLt– Part 2#

This post continues from Part 1 where we introduced GEMM tuning concepts in hipBLASLt and explored the basics of solution search. In Part 2, we focus on offline tuning with the hipblaslt-bench tool. This workflow allows developers to benchmark candidate GEMM kernels for specific problem shapes, capture the best-performing solutions, and reuse them at runtime without rebuilding or modifying the hipBLASLt library.
Using hipBLASLt Offline Tuning with hipblaslt-bench
#
The hipblaslt-bench
tool enables developers to find the best-performing GEMM kernel for specific problem sizes. The output—known as the solution index—can be used in future GEMM calls via the HIPBLASLT_TUNING_OVERRIDE_FILE
mechanism.
These solution indices are not guaranteed to remain valid across ROCm versions. You must re-run tuning whenever you upgrade.
Workflow Overview#
Enable logging to capture GEMM shapes:
export HIPBLASLT_LOG_MASK=32
Run your GEMM operation or app. The log will emit a
hipblaslt-bench
command like:hipblaslt-bench --api_method c -m 1024 -n 512 -k 1024 \ --lda 1024 --ldb 1024 --ldc 1024 --alpha 1.0 --beta 1.0 \ --transA N --transB N --batch_count 1 \ --a_type f16_r --b_type f16_r --c_type f16_r --d_type f16_r \ --scale_type f32_r --bias_type f32_r \ --compute_type f32_r --algo_method index \ --solution_index <<<INDEX>>>
Set the following environment variable before running the hipblaslt-bench command from your GEMM operation or application. This enables tuning mode and saves the best solution index:
export HIPBLASLT_TUNING_FILE=tuning.txt
This will generate a
tuning.txt
file containing the tuned solution index after the benchmark run.To apply the tuned result at runtime, unset the tuning variable and set the override variable:
unset HIPBLASLT_TUNING_FILE export HIPBLASLT_TUNING_OVERRIDE_FILE=tuning.txt
This allows hipBLASLt to override the default solution with the one stored in
tuning.txt
. If--algo_method heuristic
was used during benchmarking, the runtime will override the default heuristic result with the pre-selected solution index found in the file. This also affects runtime behavior when using the C APIhipblasLtMatmulAlgoGetHeuristic
or the C++ APIalgoGetHeuristic
—these functions will return the tuned solution if a matching entry exists in the override file.
Example Summary#
# Step 1: Enable logging
export HIPBLASLT_LOG_MASK=32
# Step 2: Run the benchmarked GEMM
./my_gemm_app
# Step 3: Set to tuning mode
export HIPBLASLT_TUNING_FILE=tuning.txt
# Step 4: At runtime, override the default logic
unset HIPBLASLT_TUNING_FILE
export HIPBLASLT_TUNING_OVERRIDE_FILE=tuning.txt
Once enabled, your GEMM calls will use custom-tuned kernels without changing library binaries.
Advantages & Limitations#
Feature |
Description |
---|---|
Easy deployment |
No need to rebuild the library; just load a tuning file at runtime |
Re-tuning required after upgrade |
Solution indices may change between ROCm versions |
Summary#
The hipblaslt-bench
offline tuning approach is optimal when you want runtime flexibility without modifying or recompiling the library. It supports easy deployment of tuned kernels for stable GEMM workloads and allows you to update your tuning results independently of the library release. This makes it a practical choice when you need quick performance gains with minimal setup effort.
However, for maximum long-term performance and consistency—especially if you’re managing multiple library versions or hardware generations—Part 1’s find_exact.py
workflow may offer more control.
References: