What is the optimum OpenCL 2 kernel to sum floats?

问题

C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2.

Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan example:

kernel void sum_float (global float* sum, global float* values)
{
  float sum_val = 0.0f;

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    float value = work_group_reduce_add(values[index]);
    sum_val += work_group_broadcast(value, 0u);
  }

  sum[0] = sum_val;
}

The kernel above works (or seems to!). However, exclusive_scan required the work_group_broadcast function to pass the last value of one work group to the next, whereas this kernel only requires the result of work_group_reduce_add to be added to sum_val, so an atomic add is more appropriate.

OpenCL 2 provides an atomic_int which supports atomic_fetch_add. An integer version of the kernel above using atomic_int is:

kernel void sum_int (global int* sum, global int* values)
{
  atomic_int sum_val;
  atomic_init(&sum_val, 0);

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    int value = work_group_reduce_add(values[index]);
    atomic_fetch_add(&sum_val, value);
  }

  sum[0] = atomic_load(&sum_val);
}

OpenCL 2 also provides an atomic_float but it doesn't support atomic_fetch_add.

What is the best way to implement an OpenCL2 kernel to sum floats?

回答1:

kernel void sum_float (global float* sum, global float* values)
{
  float sum_val = 0.0f;

  for (size_t i = 0u; i < get_num_groups(0); ++i)
  {
    size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
    float value = work_group_reduce_add(values[index]);
    sum_val += work_group_broadcast(value, 0u);
  }

  sum[0] = sum_val;
}

this has a race condition to write data to sum's zero-indexed element, all workgroups are doing same computation which makes this O(N*N) instead of O(N) and takes more than 1100 milliseconds to complete a 1M-element array sum.

For same 1-M element array, this(global=1M, local=256)

kernel void sum_float2 (global float* sum, global float* values)
{
      float sum_partial = work_group_reduce_add(values[get_global_id(0)]);
      if(get_local_id(0)==0)
        sum[get_group_id(0)] = sum_partial; 
}

followed by this (global=4k, local=256)

kernel void sum_float3 (global float* sum, global float* values)
{
  float sum_partial = work_group_reduce_add(sum[get_global_id(0)]);
  if(get_local_id(0)==0)
    values[get_group_id(0)] = sum_partial; 
}

does the same thing in a few miliseconds except a third step. First one gets each group sums into their group-id related item and second kernel sums those into 16 values and these 16 values can easily summed by CPU(microseconds or less)(as third step).

Program works like this:

values: 1.0 1.0 .... 1.0 1.0 
sum_float2
sum: 256.0 256.0 256.0
sum_float3
values: 65536.0 65536.0 .... 16 items total to be summed by cpu

if you need to use atomics, you should do it as sparsely as possible. Easiest example can be using local atomics to sum many values by each group and then doing last step using a single global atomic function per group to add all. I don't have a C++ setup ready for OpenCL for now, but I guess OpenCL 2.0 atomics are better when you are using multiple devices with same memory resource(probably streaming mode or in SVM) and/or a CPU using C++17 functions. If you don't have multiple devices computing on same area at same time, then I suppose that these new atomics can only be a micro-optimization on top of already working OpenCL 1.2 atomics. I didn't use these new atomics so take all these as a grain of salt.

来源：https://stackoverflow.com/questions/46861492/what-is-the-optimum-opencl-2-kernel-to-sum-floats

标签

c++

opencl

gpgpu

c++17

sycl