How to accumulate vectors in OpenCL?

问题

I have a set of operations running in a loop.

for(int i = 0; i < row; i++)
{
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]
    sum += arr1[0] - arr2[0]

    arr1 += offset1;
    arr2 += offset2;
}

Now I'm trying to vectorize the operations like this

for(int i = 0; i < row; i++)
{
    convert_int4(vload4(0, arr1) - vload4(0, arr2));

    arr1 += offset1;
    arr2 += offset2;
}

But how do I accumulate the resulting vector in the scalar sum without using a loop?

I'm using OpenCL 2.0.

回答1:

The operation is called "reduction" and there seems to be some information on it here.

In OpenCL special functions seem to be implemented, one being work_group_reduce() that might aid you: link.

And a presentation including some code: link.

回答2:

For float2,float4 and similar, easiest version could be dot product. (conversion from int to float could be expensive)

float4 v1=(float4 )(1,2,3,4);
float4 v2=(float4 )(5,6,7,8);

float sum=dot(v1-v2,(float4)(1,1,1,1));

this is equal to

(v1.x-v2.x)*1 + (v1.y-v2.y)*1+(v1.z-v2.z)*1+(v1.w-v2.w)*1

and if there is any hardware support for it, leaving it to compiler's mercy should be okay. For larger vectors and especially arrays, J.H.Bonarius's answer is the way to go. Only CPU has such vertical sum operations as I know, GPU doesn't have this but for the sake of portability, dot product and work_group_reduce are easiest ways to achieve readability and even performance.

Dot product has extra multiplications so it may not be good always.

回答3:

I have found a solution which seems to be the closest way I could have expected to solve my problem.

uint sum = 0;
uint4 S;

for(int i = 0; i < row; i++)
{
    S += convert_uint4(vload4(0, arr1) - vload4(0, arr2));

    arr1 += offset1;
    arr2 += offset2;
}

S.s01 = S.s01 + S.s23;
sum = S.s0 + S.s1;

OpenCL 2.0 provides this functionality with vectors where the elements of the vectors can successively be replaced with the addition operation as shown above. This can support up to a vector of size 16. Larger operations can be split into factors of smaller operations. For example, for adding the absolute values of differences between two vectors of size 32, we can do the following:

uint sum = 0;
uint16 S0, S1;

for(int i = 0; i < row; i++)
{
    S0 += convert_uint16(abs(vload16(0, arr1) - vload16(0, arr2)));
    S1 += convert_uint16(abs(vload16(1, arr1) - vload16(1, arr2)));

    arr1 += offset1;
    arr2 += offset2;
}

S0 = S0 + S1;
S0.s01234567 = S0.s01234567 + S0.s89abcdef;
S0.s0123 = S0.s0123 + S0.s4567;
S0.s01 = S0.s01 + S0.s23;
sum = S0.s0 + S0.s1;

来源：https://stackoverflow.com/questions/42156691/how-to-accumulate-vectors-in-opencl

标签

opencl

opencl-c