Programming Tensor Descriptors in Composable Kernel (CK)

Programming Tensor Descriptors in Composable Kernel (CK)#

March 25, 2026 by Hang Yang, Spandan Tiwari, Xinjun Niu, Wei Luo, Felix Marty, Ashish Sirasao.

6 min read. | 1554 total words.

Applications & models

HPC

HPC, AI

Writing efficient GPU kernels requires more than knowing the API—it demands a deep understanding of the underlying concepts, from GPU architecture to low-level programming patterns. This blog series demystifies GPU kernel programming on AMD GPUs by breaking down common kernels into their fundamental building blocks. Rather than treating GPU programming as a black box, each blog focuses on a specific concept, starting from first principles and building up to complete implementations with simple, insightful example code. In this blog, you will learn one of the most fundamental concepts in Composable Kernel (CK): the TensorDescriptor—a powerful abstraction for managing multi-dimensional data layouts and transformations. By the end of this series, you will be able to not only understand existing GPU kernels but also design and optimize your own.

Tensor#

Conceptual Understanding#

Logically speaking, a Tensor can be understood as a mapping from logical coordinates to physical memory addresses. For example, if we define a K-dimensional Tensor \(T(d_1, d_2, d_3, ..., d_k)\), we can read and write elements within this Tensor as follows:

\[T[a_1, a_2, \ldots, a_K] = T[b_1, b_2, \ldots, b_K]\]

Assuming the data pointer of Tensor is P, and the stride for each dimension is \([s_1, s_2, ... s_k]\), then in terms of physical storage, the above data access translates to:

\[P\left[\sum_{i=1}^{K} a_i s_i\right] = P\left[\sum_{i=1}^{K} b_i s_i\right]\]

This demonstrates the fundamental mapping: logical multi-dimensional coordinates are converted to a single linear memory offset by taking the dot product of the coordinate vector with the stride vector. CK (Composable Kernel) needs to implement this mapping relationship efficiently.

TensorDescriptor#

Overview#

CK uses TensorDescriptor to define Tensors. The source code definition is as follows:

template <typename Transforms,
          typename LowerDimensionIdss,
          typename UpperDimensionIdss,
          typename VisibleDimensionIds,
          typename ElementSpaceSize>
struct TensorDescriptor

There are a few confusing concepts in the struct definition, we’ll explain them next. Let’s start with Transforms.

The Concept of Transform#

To understand TensorDescriptor, we need to introduce a core concept: Transform. In CK, a Transform is defined as a struct type. TensorDescriptor uses a tree structure composed of multi-level coordinates and multiple Transforms to represent a Tensor, as shown in Figure 1.

Each Transform defines a method called CalculateLowerIndex, which maps upper-level coordinates to lower-level coordinates. At the bottom level of the TensorDescriptor hierarchy, we have a one-dimensional coordinate that directly corresponds to physical memory storage. Through a series of Transforms, we can construct any desired target coordinate system from this base.

Example: Building a 3D Tensor from a 2D Base#

Let’s consider an example: we start with an (M, K) Tensor, then split the first dimension into two dimensions, resulting in a final (M1, M2, K) Tensor. If we represent this using CK’s TensorDescriptor, the data structure looks like this:

Figure 1. TensorDescriptor tree structure

Note: If we define a 2D tensor in CK, an instance of Transform named Embed will be implicitly inserted, as shown in Figure 1.

In the diagram:

Circle nodes represent dimensions/coordinates.
Square boxes represent Transforms.

Coordinate Mapping Process#

How do we map high-dimensional coordinates to physical addresses? Following the structure shown above, we execute each Transform’s CalculateLowerIndex method from top to bottom. For example, to map coordinates (a1, a2, a3) to a physical address, the process is illustrated in Figure 2, below:

Figure 2: Coordinate transformation process

Dimension Indexing System#

If we assign a global index to each dimension (all the circles in the figure) shown in Figure 1, we obtain Figure 3:

Figure 3: Global dimension numbering

We define the following terminology:

Upper dimension id: The dimension id(s) above each Transform.
Lower dimension id: The dimension id(s) below each Transform.
Visible dimension id: The dimension id(s) at the top level of the tree structure (what the user directly interacts with).

From Figure 3, we can extract four tuples that fully describe the TensorDescriptor:

Transforms = [Embed, Unmerge, Passthrough]
LowerDimensionIdss = [
    [0], [1], [2]
]
UpperDimensionIdss = [
    [1, 2],
    [3, 4],
    [5]
]
VisibleDimensionIds = [3, 4, 5]

For one specific Transform Transforms[i], its upper dimension ids are UpperDimensionIdss[i] and its lower dimension ids are LowerDimensionIdss[i].

These tuples encode:

Which transforms are applied.
How dimensions are connected between levels.
Which dimensions are exposed to the user.

Code Example#

Here’s a complete example demonstrating how to create and use TensorDescriptors:

#include "ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle.hpp"
#include "ck/library/utility/literals.hpp"
#include "ck/utility/functional3.hpp"
#include "ck/utility/static_buffer.hpp"
#include "ck/utility/tuple.hpp"
#include "ck/tensor_description/tensor_descriptor_helper.hpp"
#include "ck/tensor_operation/gpu/grid/gridwise_gemm_xdlops_bwd_weight.hpp"
 8
int main() {
    index_t M = 256, K = 128;
11
    // Instantiate the Transform UnMerge to split the dimension M into (M1, M2).
    // There are multiple ways to instantiate an Unmerge Transform
    // auto unmerge = make_unmerge_transform(make_tuple(4, 64));
    // auto up_lengths = make_tuple(4, 64);
    // Tuple<Number<4>, Number<64>> up_lengths;
    Tuple<int, int> up_lengths{4, 64};
    UnMerge<decltype(up_lengths), false> unmerge{up_lengths};
19
    // Try calling the CalculateLowerIndex method of UnMerge
    // This maps coordinates (1, 3) to a single linear index
    auto lower_idx = make_multi_index(0);
    unmerge.CalculateLowerIndex(lower_idx, make_multi_index(1, 3));
    printf("Unmerge lower_idx = %d\n", lower_idx[Number<0>{}]);
25
    // For each layer of dimensions, unhandled dimensions have to be passed through an identity transform named PassThrough.
    // There are multiple ways to instantiate a PassThrough Transform
    // auto passthrough = make_pass_through_transform(K);
    PassThrough<int> passthrough{K};
30
    // Create a naive tensor descriptor (implicitly includes an Embed Transform)
    // This creates a 2D tensor with row-major layout (stride K for rows, stride 1 for columns)
    auto tensor_desc = make_naive_tensor_descriptor(make_tuple(M, K), make_tuple(K, 1));
    printf(
         "tensor_desc.shape = %d, %d\n",
         tensor_desc.GetLength(Number<0>{}),
         tensor_desc.GetLength(Number<1>{})
    );
39
    // When performing transforms, use logical dimension ids, not global dimension ids
    // This transforms the (256, 128) tensor to (4, 64, 128)
    // - Dimension 0 (size 256) is unmerged into dimensions (4, 64)
    // - Dimension 1 (size 128) passes through unchanged
    auto transformed_tensor_desc = transform_tensor_descriptor(
         tensor_desc,
         make_tuple(unmerge, passthrough),
         make_tuple(Sequence<0>{}, Sequence<1>{}), // Lower dimension ids
         make_tuple(Sequence<0, 1>{}, Sequence<2>{}) // Upper dimension ids
    );
    printf(
         "transformed_tensor_desc.shape = %d, %d, %d\n",
         transformed_tensor_desc.GetLength(Number<0>{}),
         transformed_tensor_desc.GetLength(Number<1>{}),
         transformed_tensor_desc.GetLength(Number<2>{})
    );
56
    // Calculate the physical offset for coordinate (1, 3, 2)
    auto coord = make_multi_index(1, 3, 2);
    printf("physical offset = %d\n", transformed_tensor_desc.CalculateOffset(coord));
60
    // Get tensor coordinate which includes hidden intermediate dimension values
    auto tensor_coord = make_tensor_coordinate(transformed_tensor_desc, coord);
    auto hidden_idx = tensor_coord.GetHiddenIndex();
    printf("hidden_idx = ");
    static_for<0, hidden_idx.Size(), 1>{}([&](auto i){
        printf("%d, ", hidden_idx[i]);
    });
    return 0;
}

After compiling and running, the output is:

Unmerge lower_idx = 67
tensor_desc.shape = 256, 128
transformed_tensor_desc.shape = 4, 64, 128
physical offset = 8578
hidden_idx = 8578, 67, 2, 1, 3, 2,

Understanding the Output#

Unmerge lower_idx = 67: When we unmerge coordinates (1, 3) with dimensions (4, 64), we get 1 * 64 + 3 = 67
tensor_desc.shape = 256, 128: The original 2D tensor shape
transformed_tensor_desc.shape = 4, 64, 128: The transformed 3D tensor shape generated by splitting the first dimension (256) into two dimensions (4, 64)
physical offset = 8578: The linear memory offset for coordinate (1, 3, 2), calculated as (1 * 64 + 3) * 128 + 2 = 67 * 128 + 2 = 8578
hidden_idx: Contains values at all levels of the tree structure from Figure 3

Chaining Transforms#

We can continue to transform the tensor_desc. For example, let’s merge the second and third dimensions into a single dimension:

// #include ...
 2
int main() {
 4
    // ...
 6
    // Merge dimensions with sizes 64 and 128
    auto low_dims = make_tuple(64, 128);
    Merge_v4_no_carry<decltype(low_dims)> merge{low_dims};
10
    // Transform: keep dimension 0 (size 4), merge dimensions 1 and 2 (64 * 128 = 8192)
    auto new_transformed_tensor_desc = transform_tensor_descriptor(
             transformed_tensor_desc,
             make_tuple(make_pass_through_transform(4), merge),
             make_tuple(Sequence<0>{}, Sequence<1, 2>{}), // Lower dimension ids
             make_tuple(Sequence<0>{}, Sequence<1>{})     // Upper dimension ids
     );
     printf(
         "new_transformed_tensor_desc.shape = %d, %d\n",
         new_transformed_tensor_desc.GetLength(Number<0>{}),
         new_transformed_tensor_desc.GetLength(Number<1>{})
     );
}

The output is:

new_transformed_tensor_desc.shape = 4, 8192

This demonstrates the composability of transforms - we can chain multiple transformation operations to achieve complex tensor layout manipulations. The final shape (4, 8192) shows that we’ve successfully merged the (64, 128) dimensions into a single dimension of size 8192.

Key Takeaways#

TensorDescriptor provides a flexible way to represent complex tensor layouts through hierarchical transformations
Transforms are composable operations that map between coordinate spaces
Common transforms include:
- Embed: Maps multi-dimensional coordinates to linear memory
- Unmerge: Splits one dimension into multiple dimensions
- Merge: Combines multiple dimensions into one
- PassThrough: Preserves a dimension unchanged
The tree structure allows CK to efficiently compute physical memory offsets from logical coordinates
Transforms can be chained to build complex layouts from simpler ones

This design enables CK to handle various tensor layouts (row-major, column-major, tiled, etc.) in a unified and composable manner, which is essential for optimizing GPU kernel performance.

Example of Matrix Transpose#

Matrix Transpose Overview#

Matrix transpose is a fundamental operation in linear algebra where we swap rows and columns of a matrix. Given an input matrix A of shape (M, K), the transpose operation produces an output matrix \(A^T\) of shape (K, M), where \(A^T[i,j] = A[j,i]\).

In this example, we demonstrate an efficient GPU implementation using CK. The key idea is to leverage parallelism at multiple levels:

Each GPU thread processes a 4×4 sub-matrix
Threads are organized into blocks of 8×8 (64 threads per block)
Each block therefore processes a 32×32 tile of the input matrix

This approach is efficient because:

Vectorized memory access: We use vector loads/stores for coalesced global memory access
Register-level transpose: The 4×4 transpose happens entirely in registers (VGPRs)
Minimal synchronization: No shared memory or thread synchronization needed

The data processing of each thread is shown in Figure 4 below:

Figure 4: Data processing of each thread

There are 80 CUs on AMD MI308X. Please refer to the official AMD documentation for GPU architecture. Each block will occupy one CU. For demonstration purposes, we set M and K so that there are exactly 80 blocks to fully utilize the 80 CUs. Let’s say M = 80 * 32 = 2560 and K = 32.

This matrix transpose example demonstrates several key CK concepts:

Tensor descriptors: Clean abstraction for multi-dimensional data with arbitrary strides
Dynamic buffers: Type-safe GPU memory access with coordinate transformation
Compile-time loops: static_for enables loop unrolling and optimization
Vector types: Efficient vectorized memory operations
Register-level computation: Maximizing throughput by keeping data in registers

Host Code Walkthrough#

The host code sets up the data, launches the kernel, and verifies the results. We used some APIs in CK, such as HostTensorDescriptor and DeviceMem. They are self-explanatory in the code snippet below:

void matrix_transpose() {
   // =================================================
   // STEP 1: Define Matrix Dimensions
   // =================================================
   // We'll transpose a 2560×32 matrix to a 32×2560 matrix
   // M = number of rows, K = number of columns
   index_t M = 2560;
   index_t K = 32;
  9
   // =================================================
   // STEP 2: Create and Initialize Host (CPU) Input Tensor
   // =================================================
   // Create a host tensor descriptor with:
   //   - Shape: {M, K} = {2560, 32}
   //   - Strides: {K, 1} = {32, 1} (row-major layout)
   HostTensorDescriptor host_desc{{M, K}, {K, 1}};
 17
   // Allocate the host tensor using the descriptor
   Tensor<float> host_tensor(host_desc);
 20
   // Fill the tensor with random integer values between -5 and 5
   // This helps with debugging (integer values are easier to verify)
   ck::utils::FillUniformDistributionIntegerValue<float>{-5.f, 5.f}(host_tensor);
 24
   // =================================================
   // STEP 3: Allocate Device (GPU) Memory and Copy Input Data
   // =================================================
   // Allocate GPU memory for the input matrix
   // Size = number of elements × size of each element
   DeviceMem device_buf(sizeof(float) * host_tensor.mDesc.GetElementSpaceSize());
 31
   // Copy data from host to device (CPU → GPU)
   device_buf.ToDevice(host_tensor.mData.data());
 34
   // =================================================
   // STEP 4: Create and Allocate Host/Device Output Tensor
   // =================================================
   // Create descriptor for the TRANSPOSED output matrix
   // Note: dimensions are swapped from {M, K} to {K, M}
   //   - Shape: {K, M} = {32, 2560}
   //   - Strides: {M, 1} = {2560, 1} (row-major layout)
   HostTensorDescriptor ret_desc{{K, M}, {M, 1}};
 43
   // Allocate the host output tensor
   Tensor<float> ret_tensor(ret_desc);
 46
   // Allocate GPU memory for the output matrix
   DeviceMem ret_buf(sizeof(float) * ret_tensor.mDesc.GetElementSpaceSize());
 49
   // Initialize GPU output buffer with host data (zeros initially)
   ret_buf.ToDevice(ret_tensor.mData.data());
 52
   // =================================================
   // STEP 5: Configure Kernel Launch Parameters
   // =================================================
   // Block dimension: 8×8 threads = 64 threads per block
   // Each thread processes a 4×4 sub-matrix
   // Therefore, each block processes a 32×32 tile (8×4 = 32 in each dimension)
   dim3 blockDim{8, 8};
 60
   // =================================================
   // STEP 6: Launch the Kernel
   // =================================================
   // Grid dimension calculation:
   //   - Each block handles 32 rows (8 threads × 4 rows per thread)
   //   - Total rows: M = 2560
   //   - Grid size: M / 32 = 2560 / 32 = 80 blocks
   //   - Equivalently: M / 8 / 4 = 2560 / 8 / 4 = 80
   //
   // Launch configuration summary:
   //   - Grid: 80 blocks
   //   - Block: 8×8 = 64 threads
   //   - Total threads: 80 × 64 = 5,120 threads
   //   - Work per thread: 4×4 = 16 elements
   //   - Total elements: 5,120 × 16 = 81,920 = 2560 × 32 ✓
   matrix_transpose_kernel<<<M / 8 / 4, blockDim, 0, 0>>>(
       (float*)device_buf.GetDeviceBuffer(),   // Input: M×K matrix
       (float*)ret_buf.GetDeviceBuffer(),      // Output: K×M matrix
       M, K                                      // Dimensions
   );
 81
   // =================================================
   // STEP 7: Copy Results Back and Verify
   // =================================================
   // Copy the transposed matrix from GPU back to CPU
   ret_buf.FromDevice(ret_tensor.mData.data());
 87
   // Print a 6×6 sample of the input matrix
   printf("host_tensor: \n");
   for (int i = 0; i < 6; i++) {
       for (int j = 0; j < 6; j++) {
           // Access: row i, column j in row-major layout (stride = K)
           printf("%f, ", host_tensor.mData[i * K + j]);
       }
       printf("\n");
   }
 97
   // Print a 6×6 sample of the output (transposed) matrix
   printf("ret_tensor: \n");
   for (int i = 0; i < 6; i++) {
       for (int j = 0; j < 6; j++) {
           // Access: row i, column j in row-major layout (stride = M)
           printf("%f, ", ret_tensor.mData[i * M + j]);
       }
       printf("\n");
   }
107
   printf("matrix_transpose done\n");
}

Kernel Implementation Walkthrough#

Here’s the complete kernel with detailed inline comments explaining each step:

__global__ void matrix_transpose_kernel(float * src, float * dst, index_t M, index_t K) {
   // =================================================
   // STEP 1: Tensor Descriptor Setup
   // =================================================
   // Create a row-major tensor descriptor for the input matrix
   // - Shape: (M, K)
   // - Strides: (K, 1) means row-major layout
   // - Element at (i, j) is at offset i * K + j
   auto tensor_desc = make_naive_tensor_descriptor(make_tuple(M, K), make_tuple(K, 1));
 10
   // Wrap the raw pointer with a dynamic buffer abstraction
   // This provides type-safe access with coordinate transformation
   auto buf = make_dynamic_buffer<AddressSpaceEnum::Global>(
           src, tensor_desc.GetElementSpaceSize());
 15
   // Create descriptor for the output (transposed) matrix
   // - Shape: (K, M) - dimensions are swapped
   // - Strides: (M, 1) - still row-major layout
   // - Element at (i, j) is at offset i * M + j
   auto ret_desc = make_naive_tensor_descriptor(make_tuple(K, M), make_tuple(M, 1));
 21
   // Wrap the output buffer
   auto ret_buf = make_dynamic_buffer<AddressSpaceEnum::Global>(
           dst, ret_desc.GetElementSpaceSize());
 25
   // =================================================
   // STEP 2: Compute Thread's Data Region
   // =================================================
   // Each thread processes a 4×4 sub-matrix
   // x: column offset (starting position in the column dimension)
   // y: row offset (starting position in the row dimension)
   //
   // Breaking down x calculation:
   //   - blockIdx.x * 8 * 4: Block's column offset (each block spans 32 columns)
   //   - threadIdx.x * 4: Thread's column offset within block (0, 4, 8, ..., 28)
   //
   // Breaking down y calculation:
   //   - threadIdx.y * 4: Thread's row offset within block (0, 4, 8, ..., 28)
   int x = (blockIdx.x * 8 + threadIdx.x) * 4, y = threadIdx.y * 4;
 40
   // =================================================
   // STEP 3: Allocate Thread-Local Storage (VGPRs)
   // =================================================
   // Allocate 16 floats in registers to hold our 4×4 sub-matrix
   // This is stored in Vector General Purpose Registers (VGPRs) on the GPU
   vector_type<float, 16> thread_local_buf;
 47
   // Create a view of the buffer as 4 vectors of 4 elements each (d4_t)
   // This enables vectorized memory access patterns
   // a[0] = 4 floats, a[1] = 4 floats, a[2] = 4 floats, a[3] = 4 floats
   auto& a = thread_local_buf.AsType<vector_type<float, 16>::d4_t>();
 52
   // =================================================
   // STEP 4: Read Input Data (4×4 Sub-Matrix)
   // =================================================
   // Read 4 rows from the input matrix, each containing 4 consecutive elements
   // This is a compile-time loop that will be fully unrolled
   //
   // Memory access pattern:
   //   i=0: Read row y,   columns [x, x+1, x+2, x+3] → store in a[0]
   //   i=1: Read row y+1, columns [x, x+1, x+2, x+3] → store in a[1]
   //   i=2: Read row y+2, columns [x, x+1, x+2, x+3] → store in a[2]
   //   i=3: Read row y+3, columns [x, x+1, x+2, x+3] → store in a[3]
   //
   // The Get<d4_t> performs a vectorized read of 4 consecutive floats,
   // which is efficient for coalesced memory access (threads in a warp
   // access consecutive memory locations)
   static_for<0, 4, 1>{}([&](auto i){
       a(Number<i>{}) = buf.Get<vector_type<float, 16>::d4_t>(
           tensor_desc.CalculateOffset(Tuple<int, int>{x + i, y}), true);
   });
 72
   // =================================================
   // STEP 5: In-Register Transpose
   // =================================================
   // Now we transpose the 4×4 matrix stored in registers
   // Create a view of the buffer as 16 individual floats (d1_t)
   auto& b = thread_local_buf.AsType<vector_type<float, 16>::d1_t>();
 79
   // Perform in-place transpose by swapping elements across the diagonal
   // Algorithm: swap b[i*4 + j] ↔ b[j*4 + i] for all i > j
   //
   // Visual representation of swaps:
   //   [0 1 2 3 ]          [0 4 8 12]
   //   [4 5 6 7 ] --> [1 5 9 13]
   //   [8 9 10 11]         [2 6 10 14]
   //   [12 13 14 15]       [3 7 11 15]
   //
   // Swap sequence:
   //   i=1, j=0: Swap b[4] ↔ b[1]
   //   i=2, j=0: Swap b[8] ↔ b[2]
   //   i=2, j=1: Swap b[9] ↔ b[5]
   //   i=3, j=0: Swap b[12] ↔ b[3]
   //   i=3, j=1: Swap b[13] ↔ b[7]
   //   i=3, j=2: Swap b[14] ↔ b[11]
   static_for<0, 4, 1>{}([&](auto i){
       static_for<0, i, 1>{}([&](auto j){
           auto tmp = b(Number<i * 4 + j>{});
           b(Number<i * 4 + j>{}) = b(Number<j * 4 + i>{});
           b(Number<j * 4 + i>{}) = tmp;
       });
   });
103
   // =================================================
   // STEP 6: Write Output Data
   // =================================================
   // Write the transposed 4×4 sub-matrix to the output matrix
   //
   // CRITICAL: Note the coordinate swap!
   //   - Input read position: (x+i, y) - we read rows from columns x through x+3
   //   - Output write position: (y+i, x) - we write to the TRANSPOSED location
   //
   // This means:
   //   - Input element at (x+i, y+j) is written to output at (y+j, x+i)
   //   - What were columns in input become rows in output
   //
   // Memory write pattern:
   //   i=0: Write row y,   columns [x, x+1, x+2, x+3] from a[0]
   //   i=1: Write row y+1, columns [x, x+1, x+2, x+3] from a[1]
   //   i=2: Write row y+2, columns [x, x+1, x+2, x+3] from a[2]
   //   i=3: Write row y+3, columns [x, x+1, x+2, x+3] from a[3]
   //
   // The Set<d4_t> performs a vectorized write of 4 consecutive floats
   static_for<0, 4, 1>{}([&](auto i){
       ret_buf.Set<vector_type<float, 16>::d4_t>(
           ret_desc.CalculateOffset(Tuple<int, int>{y + i, x}), true, a(Number<i>{}));
   });
}

Notice the vector_type<float, 16>. It will allocate a buffer containing 16 float numbers in registers. The method thread_local_buf.AsType can view this buffer as different shapes. static_for is used for compile-time unrolling loops.

Performance Test#

For comparison, we implement matrix transpose in PyTorch and test the performance of both implementations with rocprofv3:

import torch
2
a = torch.randn([2560, 32], dtype=torch.float32).cuda()
b = a.transpose(0, 1).contiguous()

The performance of the PyTorch implementation is 8.4 μs and that of our CK implementation is 5.820 μs. This represents a 44.3% throughput improvement.

Summary#

In this blog, you learned the fundamentals of AMD GPU kernel programming using Composable Kernel (CK). Specifically, you explored:

TensorDescriptor: How CK uses a tree of hierarchical transforms to map logical multi-dimensional coordinates to physical memory addresses, providing a flexible and composable abstraction for complex tensor layouts.
Core transforms: The roles of Embed, Unmerge, Merge, and PassThrough transforms, and how they can be chained to build arbitrarily complex data layouts from simple building blocks.
Practical GPU kernel development: A complete matrix transpose implementation that leverages vectorized memory access, register-level computation, and compile-time loop unrolling for efficient execution on AMD GPUs.

The 4x4 per-thread transpose approach demonstrated here strikes a good balance between parallelism, memory efficiency, and register usage, making it an excellent template for similar GPU kernels.

In the next blog in this series, we will break down GEMM (General Matrix Multiply) into basic parts and understand them one by one, building on the TensorDescriptor foundation covered here. Stay tuned to continue your journey from beginner to expert in AMD GPU programming.

Acknowledgement#

We would like to express our thanks to AMD Quark Team for their insightful feedback and technical assistance.

Disclaimers#

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.