C++17 parallel algorithms and HIPSTDPAR#

The C++17 standard added the concept of parallel algorithms to the pre-existing C++ Standard Library. The parallel version of algorithms like std::transform maintain the same signature as the regular serial version, except for the addition of an extra parameter specifying the execution policy to use. This flexibility allows users that are already using the C++ Standard Library algorithms to take advantage of multi-core architectures by just introducing minimal changes to their code.

Starting with ROCm 6.1, the parallel algorithms seamlessly offload to AMD accelerators via HIPSTDPAR, as long as the user is willing to add an extra compiler flag or two.

Whilst the functionality introduced by HIPSTDPAR is available for all AMD GPUs (including consumer cards), this blog post focuses on the AMD CDNA2™ and CDNA3™ architectures (MI200 and MI300 series cards, respectively) using ROCm 6.1. As a code example, we focus on the Travelling Salesman Problem (TSP) solver available here.

The travelling salesman problem#

The travelling salesman problem tries to answer the following question: “Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?”. This problem is particularly hard to solve (NP-hard) due to exponential complexity; adding an extra city to the list causes an exponential growth in the number of combinations to check. Solving this problem by just enumerating all possible combinations and checking each one of them is computationally prohibitive for problems with more than 17 or 18 cities. For real world applications, advanced methods are used (cutting planes and branch and bound techniques) but for the purposes of this blog we focus on a embarrassingly parallel implementation of the brute-force approach.

The TSP solver we look at relies on the following function to check the various permutations of cities and pick the one with the lowest cost/distance. Here is a detailed implementation that does not make use of any parallelism:

template<int N>
route_cost find_best_route(int const* distances)
{
  return std::transform_reduce(
    counting_iterator(0),
    counting_iterator(factorial(N)),
    route_cost(),
    [](route_cost x, route_cost y) { return x.cost < y.cost ? x : y; },
    [=](int64_t i) {
    int cost = 0;

    route_iterator<N> it(i);

    // first city visited
    int from = it.first();

    // visited all other cities in the chosen route
    // and compute cost
    while (!it.done())
    {
      int to = it.next();
      cost += distances[to + N*from];
      from = to;
    }

    // update best_route -> reduction
    return route_cost(i, cost);
  });
}

The std::transform_reduce algorithm performs two operations:

  1. a transformation (equivalent to a map operation) implemented by the lambda function passed as final argument;

  2. a reduction operation, expressed as lambda function as fourth argument.

The function above runs through all elements from 0 to N!, each of which expresses a particular permutation of all cities, computes the cost of the particular path, and returns an instance of route_cost object that includes the id of the particular path and the cost associated with it. At the end, a reduction is performed by comparing the cost of the various paths and selecting the one with lowest cost.

On an AMD Zen4 processor, this serial code takes about 11.52 seconds to compute the best path for a TSP instance involving twelve cities. The same code takes about 156 seconds for a TSP instance involving thirteen cities. This is a normal consequence of the exponential growth of the search space imposed by the TSP.

Execution policies and HIPSTDPAR#

Since each of the N! paths are independent, computing their individual cost is an embarrassingly parallel operation. C++17 allows developers to easily parallelize the previous code by just passing an execution policy as the first argument of the algorithm invocation. The C++17 standard defines three possible execution policies:

  • std::execution::sequenced_policy and the corresponding policy object to pass as argument std::execution::seq

  • std::execution::parallel_policy and the corresponding policy object to pass as argument std::execution::par

  • std::execution::parallel_unsequenced_policy and the corresponding policy object to pass as argument std::execution::par_unseq

Execution policies allow the user to convey information to the implementation about the invariants that user code shall enforce / maintain, thus allowing the latter to possibly adopt more favourable / performant execution.

std::execution::sequenced_policy#

The sequenced policy constrains the implementation to perform all operations on the thread that invoked the algorithm, inhibiting possible parallel execution. All operations are indeterminately sequenced within the caller thread, which implies subsequent invocations of the same algorithm, within the same thread, can have their operations sequenced differently.

std::execution::parallel_policy#

The parallel policy allows the implementation to employ parallel execution. Operations may be performed on the thread that invoked the algorithm or on threads created by the standard library implementation. All operations are indeterminately sequenced within a thread, for all threads used to perform the computation described by the algorithm invocation. Furthermore, there are no ordering guarantees provided for the element access function invocations themselves. Compared to the sequenced policy, additional constraints are imposed on the various components used by the algorithm. In particular, operations on iterators, values, and callable objects, as well as their transitive closure, must be data race free.

In the previous example, it is possible to parallelize the find_best_route function by passing as first extra argument the std::execution:par policy as follows:

return std::transform_reduce(
  std::execution::par, // THE SIMPLE CHANGE
  counting_iterator(0),
  counting_iterator(factorial(N)),
  route_cost(),
  [](route_cost x, route_cost y) { return x.cost < y.cost ? x : y; },
  [=](int64_t i)

By making this simple change, the code will now run on all CPU cores available. On the CPU portion of a MI300A, equipped with 48 Zen4 logical cores, solving an instance of TSP with 12 cities takes about 0.34 seconds. This parallel run is almost 34x faster compared to the 11.52 seconds needed by the serial version! For an instance of TSP with thirteen cities the parallel version takes about 5 seconds. Finally, for a bigger problem involving fourteen cites, the 48 Zen4 logical cores take about 77 seconds.

std::execution::parallel_unsequenced_policy#

This policy guarantees the most restrictive requirements are met by user provided code. An algorithm invoked with this policy may perform the steps in unordered and unsequenced ways with respect to one another. This means that the various operations can be interleaved with each other on the same thread. Also any given operation may start on a thread and end on a different thread. When specifying the parallel unsequenced policy, the user guarantees that no operations that entail calling a function that synchronizes-with another function are employed. In practice, this means that user code does not do any memory allocation / deallocation, only relies on lock-free specializations of std::atomic, and does not rely on synchronization primitives such as std::mutex.

This policy is currently the only one that can be chosen to offload parallelism to AMD accelerators. To trigger the GPU offload of all parallel algorithms invoked with the parallel unsequenced policy, the --hipstdpar flag must be passed at compile time. Furthermore, for GPU targets other than the current default (gfx906), the user must also pass --offload-arch= specifying which GPU is being used.

On MI300A, by simply switching policy and recompiling with the aforementioned flags, the execution time for an instance of TSP with thirteen cities goes down to 0.5 seconds. When using fourteen cities, the use of the GPU portion of MI300A brings down the execution time to 4.8 seconds from the 77 seconds needed by the parallel version running on 48 Zen4 logical cores. And because everybody loves a good table, let us conclude this section by summarising the progression from sequenced execution on the CPU to parallel unsequenced execution offloaded to the accelerator:

14-city TSP

Timing (s)

seq

2337

par

77

par_unseq on CPU

75

par_unseq on GPU

4.8

TeaLeaf#

A more complex example showing the use and performance of HIPSTDPAR is TeaLeaf. The code is a C++ implementation of the TeaLeaf heat conduction mini-app from the University of Bristol, UK. Multiple implementations illustrate various parallel programming paradigms, including HIP and parallelised standard algorithms. This allows us to make a fair performance comparison between an optimized, HIP-based implementation and a HIPSTDPAR one. For the purpose of this test, we selected the tea_bm_5.in benchmark, comprising of a 2D grid of 4000x4000 cells and 10 time steps.

For the HIPSTDPAR version, on a MI300A card, the following output is obtained:

Timestep 10
CG:                    3679 iterations
Wallclock:             40.884s
Avg. time per cell:    2.555271e-06
Error:                 9.805532e-31

Checking results...
Expected 9.546235158221428e+01
Actual   9.546235158231138e+01
This run PASSED (Difference is within 0.00000000%)

As for the HIP version, it performs as follows:

Timestep 10
CG:                    3679 iterations
Wallclock:             34.286s
Avg. time per cell:    2.142853e-06
Error:                 9.962546e-31

Checking results...
Expected 9.546235158221428e+01
Actual   9.546235158231144e+01
This run PASSED (Difference is within 0.00000000%)

The performance difference in-between the two versions stems from the overhead associated with handling the initial page-in of non-resident memory. To “even things out”, the HIP version can be adjusted to use hipMallocManaged() as well, instead of hipMalloc(). This particular configuration is already available in the HIP version of TeaLeaf and it can be enabled by passing a simple flag at compile time. The following is the output for the HIP version of TeaLeaf when using hipMallocManaged() and XNACK for all GPU allocations.

Timestep 10
 CG:                    3679 iterations
 Wallclock:             39.573s
 Avg. time per cell:    2.473331e-06
 Error:                 9.962546e-31

 Checking results...
 Expected 9.546235158221428e+01
 Actual   9.546235158231144e+01
 This run PASSED (Difference is within 0.00000000%)

As expected, the performance of the HIP version when introducing hipMallocManaged() is comparable with the one observed for the HIPSTDPAR version. In closing, we will note that ongoing work is expected to reduce the overhead, thus bringing the offloaded version closer to the HIP one.

Nuts and bolts of HIPSTDPAR#

The ability to offload C++ Standard Parallel algorithm execution to the GPU relies on the interaction between the LLVM compiler, HIPSTDPAR, and rocThrust. Starting from ROCm 6.1, the LLVM compiler used to compile regular HIP codes will be able to forward invocations of standard algorithms which take the parallel_unsequenced_policy execution policy to the HIPSTDPAR header-only library when the --hipstdpar flag is passed. The header-only library is in charge of mapping the parallel algorithms used by C++ Standard Library into the equivalent rocThrust algorithm invocation. This very simple design allows for a low overhead implementation of the offloading for parallel standard algorithms. A natural question to ask at this point is: “computation is nice but what about the memory it operates on?”. By default, HIPSTDPAR assumes that the underlying system is HMM (Heterogeneous Memory Management)-enabled, and that page migration is possible via the handling of retry-able page-faults implemented atop XNACK (e.g., export HSA_XNACK=1). This particular mode is referred to as HMM Mode.

When these two requirements are satisfied, code offloaded to the GPU (implemented via rocThrust) triggers the page migration mechanism and data will automatically migrate from host to device. On MI300A, although physical migration is neither needed nor useful, handling page faults via XNACK is still necessary. For more details about page migration please refer to the following blog post.

On systems without HMM / XNACK we can still use HIPSTDPAR by passing an extra compilation flag: --hipstdpar-interpose-alloc. This flag will instruct the compiler to replace all dynamic memory allocations with compatible hipManagedMemory allocations implemented in the HIPSTDPAR header-only library. For example, if the application being compiled, or one of its transitive inclusions, allocates free store memory via operator new, that call will be replaced with a call to __hipstdpar_operator_new. By looking at the implementation of that function in the HIPSTDPAR library we see that the actual allocation is performed via the hipMallocManaged() function. By doing so on a non HMM-enabled system, host memory is pinned and directly accessible by the GPU without requiring any page-fault driven migration to the GPU memory. This particular mode is referred to as Interposition Mode.

Restrictions#

For both HMM and Interposition modes, the following restrictions apply:

  1. Pointers to function, and all associated features, such as e.g. dynamic polymorphism, cannot be used (directly or transitively) by the user provided callable passed to an algorithm invocation;

  2. Global / namespace scope / static / thread storage duration variables cannot be used (directly or transitively) in name by the user provided callable;

    • When executing in HMM Mode they can be used in address e.g.:

      namespace { int foo = 42; }
      
      bool never(const vector<int>& v) {
        return any_of(execution::par_unseq, cbegin(v), cend(v), [](auto&& x) {
          return x == foo;
        });
      }
      
      bool only_in_hmm_mode(const vector<int>& v) {
        return any_of(execution::par_unseq, cbegin(v), cend(v),
                      [p = &foo](auto&& x) { return x == *p; });
      }
      
  3. Only algorithms that are invoked with the parallel_unsequenced_policy are candidates for offload;

  4. Only algorithms that are invoked with iterator arguments that model random_access_iterator are candidates for offload;

  5. Exceptions cannot be used by the user provided callable;

  6. Dynamic memory allocation (e.g. operator new) cannot be used by the user provided callable;

  7. Selective offload is not possible i.e. it is not possible to indicate that only some algorithms invoked with the parallel_unsequenced_policy are to be executed on the accelerator.

In addition to the above, using Interposition Mode imposes the following additional restrictions:

  1. All code that is expected to interoperate has to be recompiled with the --hipstdpar-interpose-alloc flag i.e. it is not safe to compose libraries that have been independently compiled;

  2. automatic storage duration (i.e. stack allocated) variables cannot be used (directly or transitively) by the user provided callable e.g.

    bool never(const vector<int>& v, int n) {
      return any_of(execution::par_unseq, cbegin(v), cend(v),
                    [p = &n](auto&& x) { return x == *p; });
    }
    

But why?#

After what has been something of a whirlwind tour, it is not unreasonable to ask “but how does this benefit me, the C++ developer?”. The goal of HIPSTDPAR is to allow any C++ developer that is employing standard algorithms to leverage GPU acceleration with no cognitive overload. The application developer can remain firmly planted in the Standard C++ world, without having to step into the brave new world of GPU specific languages such as e.g. HIP or SYCL. Fortunately for us, our particular example allows for some limited, quantitative insight into just how close we got to this goal. The Tealeaf author has implemented the solver via multiple programming interfaces which means that we can use the cloc tool to count the lines of code needed by the tsp.cpp implementation:

Programming Interface

LoC

Kokkos

145

OpenACC

142

OpenMP

116

Standard C++ Serial

112

Standard C++ Parallel Algorithms

107

SYCL

169

It is apparent that using compiler flag driven offload, as enabled by HIPSTDPAR, saves on a considerable amount or typing - up to 57% versus SYCL, for example. This enables a more natural journey towards GPU accelerated execution. As a result, the programmer can focus on the algorithm / problem solving, at least initially, and discover generic algorithmic optimisations that are profitable for the GPU, without having to dive head-first into GPU “arcana”.

TL;DR, just tell me how to go fast#

Initially, HIPSTDPAR is officially supported on Linux, with Windows support forthcoming at a future date. Starting from an environment that has been set up for ROCm, using the package manager to install the hipstdpar package will, in general, bring in all the required goodness. Additionally, at the time of writing, a dependency on TBB exists, as a consequence of standard library implementation details (see e.g. Note 3). Therefore, it is necessary to install the system’s TBB package (e.g. libtbb-dev on Ubuntu). Armed thusly, and assuming that we have a main.cpp file which uses some standard algorithms to solve a given problem, the compiler driver invocation:

clang++ --hipstdpar main.cpp -o main

transparently offloads all algorithm invocations that use the std::execution::parallel_unsequenced_policy execution policy, if we are targeting a GPU compatible with the gfx906 ISA (i.e. Vega20). Otherwise, we also have to specify the target for offload:

clang++ --hipstdpar --offload-arch=gfx90a main.cpp -o main

Conclusion#

In this post, we provided a high level overview of the ROCm support for offloading C++ Standard Parallel Algorithms, aiming to show how existing C++ developers can leverage GPU acceleration without having to adopt any new, GPU specific, language (e.g., HIP) or directives (e.g., OpenMP).

We believe that this standard, extremely accessible, way of exploiting hardware parallelism will be particularly beneficial for applications targeting MI300A accelerators, where the CPU and the GPU share the same pool of HBM. Although not demonstrated today, the combination of the APU architecture and HIPSTDPAR can enable fine-grained cooperation between CPU and GPU, which become true peers, accessible via a uniform programming interface.

For an in-depth look at the compiler side of HIPSTDPAR support, the interested reader should peruse the associated AMD-LLVM documentation.

The authors would like to thank Bob Robey and Justin Chang for their helpful reviews. If you have any questions please reach out to us on GitHub Discussions.