Calculate the sum of values in an array using renderscript

问题

Hi I am a newbie and trying to code in Renderscript. I would want to know how can I perform a sum of elements in an array using render script. Is there a way I can pass the output back into the script for sequential addition? my problem statement is: Vector Sum

Description: Calculate the sum of values in an array.

Input: Integer array

Output: Integer

Any help would be much appreciated!

回答1:

I'm afraid this is a bit more complex than it seems, but I'll do my best to explain here a possible route that you can take to implement this.

What you are asking for is better known as the parallel reduction algorithm, which can implement either an array sum as in your case, or any other commutative + associative operator which, when applied iteratively over an array, will "reduce" it to a single number. Other examples are finding the maximum or minimum values of a large array. In CUDA and OpenCL, there is a well known pattern for this computation that is capable of making the best possible use of parallel threads, and if you google "CUDA reduction" for example, you'll get tons of useful info on this algorithm.

The way this is implemented is by repeatedly reducing the array in half, over and over again, until you end up with a single element. Each time you reduce it, each new element is the sum of two previous elements. Here's a picture that better depicts this algorithm:

So for example, you start with a 16-element array. You run the algorithm once, and you end up with an 8-element array -- where each of these 8 elements is the sum of two numbers from the original array.

You run it again, and you end up with 4 elements -- where each of these is the sum of two numbers from the previous step. And so on...

You keep doing this, until you end up with only one number -- your total sum.

An inefficient way of implementing this in RenderScript would be:

Java:

int[] ints; // Your data is held here.

Allocation data = Allocation.createSized(rs, Element.I32(rs), ints.length, Allocation.USAGE_SCRIPT);
data.copy1DRangeFrom(0, ints.length, ints);

ScriptC_Reduce script = new ScriptC_Reduce(rs);
script.bind_data(data);

for (int stride = ints.length / 2; stride > 0; stride /= 2) {
    script.set_stride(stride);
    script.forEach_root(input, output);
}

data.copyTo(ints);
int totalsum = ints[0];

RenderScript:

#pragma version(1)
#pragma rs java_package_name(...[your package here]...)

int stride;
int * data;

void root(const int32_t *v_in, int32_t *v_out, uint32_t x) {
    if (x < stride) data[x] += data[x + stride];
}

If you've worked with RS before, you may notice a couple strange things:

Note that "v_in" and "v_out" in the RS kernel are not used at all, because they are restricted to reading and writing the data element corresponding to the current thread index, whereas the reduce algorithm needs to access data elements at other positions. Hence, there is an int array pointer "data" which is binded from Java from an allocation with the same name, and that is what the kernel works on directly.
The kernel is called multiple times from a loop in Java, instead of doing that loop inside the kernel. This is because at each iteration, ALL the data from the previuos step must be ready at its expected position already, otherwise, "data[x + stride]" will be out of synch. In RS, a kernel call locks, meaning nothing else is executed until the kernel has finished processing the entire data. This is similar to what __syncthreads() would do inside a CUDA kernel, if you're familiar with that.

I mentioned above, however, that this is an inefficient implementation. But it should point you in the right direction. To make it more efficient, you might need to split the data into smaller chunks to be computed separately, because as given here, this algorithm would run ints.length number of threads at each iteration step, and on very large arrays that will result in a lot of steps, and a a lot of idle threads at each step.

Furthermore, this assumes the length of your array is exactly a power of 2, so that multiple halvings will result in exactly one element. For other size arrays, you may need to 0-pad your array. And here again, when working on very large arrays, 0-padding will require a lot of wasted memory.

So to fix these issues, you may want to split your array into multiple chunks of, say, 64 elements each. Therefore, if you don't have an exact array length, padding the "last" chunk up to 64 will not require that much memory. Also, you will need fewer iteration steps (and fewer idle threads) to reduce 64 elements. Of course, 64 is a magic number I just made up. Try other powers of 2 to see their results, you might see better results with other chunk sizes such as 16 or 32. I suspect performance vs. chunk size will be very hardware-dependent.

EDIT: This assumes that RenderScript can make use of a GPU driver for the device where its running on, so that it can actually launch a larger number of parallel threads. Otherwise, a CPU-only execution of a kernel like this would probably be even slower than processing the array lineary.

回答2:

Don't. Unless you have something fancier than 1 addition. Don't. The code will not be faster until you have at least 4 million integers in that array.

RenderScript: Entries:1 Total: 3 Time: 0.067ms
Simple Loop : Entries:1 Total: 3 Time: 0.001ms
RenderScript: Entries:2 Total: 97 Time: 0.614ms
Simple Loop : Entries:2 Total: 97 Time: 0.001ms
RenderScript: Entries:4 Total: 227 Time: 0.28ms
Simple Loop : Entries:4 Total: 227 Time: 0.002ms
RenderScript: Entries:8 Total: 320 Time: 0.445ms
Simple Loop : Entries:8 Total: 320 Time: 0.002ms
RenderScript: Entries:16 Total: 700 Time: 0.486ms
Simple Loop : Entries:16 Total: 700 Time: 0.002ms
RenderScript: Entries:32 Total: 1807 Time: 0.595ms
Simple Loop : Entries:32 Total: 1807 Time: 0.002ms
RenderScript: Entries:64 Total: 3218 Time: 0.624ms
Simple Loop : Entries:64 Total: 3218 Time: 0.002ms
RenderScript: Entries:128 Total: 6230 Time: 0.737ms
Simple Loop : Entries:128 Total: 6230 Time: 0.003ms
RenderScript: Entries:256 Total: 12968 Time: 0.769ms
Simple Loop : Entries:256 Total: 12968 Time: 0.005ms
RenderScript: Entries:512 Total: 26253 Time: 0.895ms
Simple Loop : Entries:512 Total: 26253 Time: 0.01ms
RenderScript: Entries:1024 Total: 52345 Time: 0.987001ms
Simple Loop : Entries:1024 Total: 52345 Time: 0.017ms
RenderScript: Entries:2048 Total: 100223 Time: 1.715ms
Simple Loop : Entries:2048 Total: 100223 Time: 0.034ms
RenderScript: Entries:4096 Total: 200375 Time: 1.213ms
Simple Loop : Entries:4096 Total: 200375 Time: 0.065ms
RenderScript: Entries:8192 Total: 403713 Time: 1.196ms
Simple Loop : Entries:8192 Total: 403713 Time: 0.163001ms
RenderScript: Entries:16384 Total: 812411 Time: 1.929ms
Simple Loop : Entries:16384 Total: 812411 Time: 0.41ms
RenderScript: Entries:32768 Total: 1620542 Time: 1.822ms
Simple Loop : Entries:32768 Total: 1620542 Time: 0.617ms
RenderScript: Entries:65536 Total: 3250733 Time: 5.955ms
Simple Loop : Entries:65536 Total: 3250733 Time: 1.384ms
RenderScript: Entries:131072 Total: 6478866 Time: 2.622ms
Simple Loop : Entries:131072 Total: 6478866 Time: 2.008ms
RenderScript: Entries:262144 Total: 12980832 Time: 3.979999ms
Simple Loop : Entries:262144 Total: 12980832 Time: 4.377001ms
RenderScript: Entries:524288 Total: 25956676 Time: 10.163ms
Simple Loop : Entries:524288 Total: 25956676 Time: 8.326ms
RenderScript: Entries:1048576 Total: 51897168 Time: 12.723001ms
Simple Loop : Entries:1048576 Total: 51897168 Time: 15.871999ms
RenderScript: Entries:2097152 Total: 103867356 Time: 32.229001ms
Simple Loop : Entries:2097152 Total: 103867356 Time: 31.367ms
RenderScript: Entries:4194304 Total: 207646704 Time: 61.628999ms
Simple Loop : Entries:4194304 Total: 207646704 Time: 63.378ms
RenderScript: Entries:8388608 Total: 415058480 Time: 103.734999ms
Simple Loop : Entries:8388608 Total: 415058480 Time: 140.088ms

This is cutting everything in the favor of the renderscript. Like assuming all the allocations will be done outside the main loop and all the and that it doesn't have to copy the data array back out (I simply called rs.finish() to ensure the renderscript finished).

#pragma version(1)
#pragma rs java_package_name(com.photoembroidery.tat.olsennoise)

int stride;
int * data;

void root(const int32_t *v_in, int32_t *v_out, uint32_t x) {
    data[x] += data[x + stride];
}

Note the launch options. You do the first reduction to trim the array down to the right factor of 2. So you take whatever the remainder is between the size and factor of two just before that and process those entries so they get reduced with the rest in factors of two. Then you process the factors of two.

//int[] array = //array of your data//;

        ScriptC_reduce script = new ScriptC_reduce(mRS);
        Allocation data = Allocation.createSized(mRS, Element.I32(mRS), array.length, Allocation.USAGE_SCRIPT);
        data.copy1DRangeFrom(0, array.length, array);
        script.bind_data(data);

        int smallest2ExpBiggerThanLength = 1;
        for (int length = arraysize; length != 0; length >>= 1,smallest2ExpBiggerThanLength <<= 1);

        int end = smallest2ExpBiggerThanLength / 2;
        int start = smallest2ExpBiggerThanLength - arraysize;
        if (start == end) {
            start = 0;
            end = end/2;
        }

        while (end > 0) {
            launchOptions.setX(start, end);
            script.set_stride(end - start);
            script.forEach_root(data, data, launchOptions);
            script.forEach_root(data,data);
            end = end >> 1;
            start = 0;
        }
        data.copyTo(array);
        int total = array[0];

The biggest inefficiency in the other answer is the launch options. You are way better off restricting the range from the get-go rather than checking for the validity of the range. You lose like 4x speed. A simple loop is going to be faster and universal, without invoking renderscript bugs. -- You need to be doing something harder than 1 add to make this worth.

回答3:

We actually are working on supporting "reduce" as a new kernel type for the next Android release. This would allow you to run an associative operation (like addition) on the cells of an Allocation, and get a single reduced result back. Code for this already exists in AOSP, but we are trying to make it more flexible/general. The current form already allows you to specify a 2 input -> 1 output kernel that can be applied across all cells.

In the meantime, you can approximate a reduce kernel by running just sequentially in an invokable and using rsGetElementAt_*() for walking the cells. It would be substantially faster than Java, where you are constantly paying for needless bounds checks in this case (and other overhead).

来源：https://stackoverflow.com/questions/21734275/calculate-the-sum-of-values-in-an-array-using-renderscript

标签

arrays

renderscript