RenderScript speedup 10x when forcing default CPU implementation

问题

I have implemented a CNN in RenderScript, described in a previous question which spawned this one. Basically, when running

adb shell setprop debug.rs.default-CPU-driver 1

there is a 10x speedup on both Nvidia Shield and Nexus 7. The average computation time goes from around 50ms to 5ms, the test app goes from around 50fps to 130 or more. There are two convolution algorithms:

(1) moving kernel
(2) im2col and GEMM from RenderScriptIntrinsicsBLAS.

Both experience similar speedup. The question is: why is this happening and can this effect be instantiated from the code in a predictable way? And is detailed information about this available somewhere?

Edit:

As per suggestions below, I verified the use of finish() and copyTo(), here is a breakdown of the procedure. The speedup reported is AFTER the call to copyTo() but without finish(). Uncommenting finish() adds about 1ms to the time.

double forwardTime = 0;
long t = System.currentTimeMillis();
//double t = SystemClock.elapsedRealtime(); // makes no difference
for (Layer a : layers) {
    blob = a.forward(blob);
}
mRS.finish();   // adds about 1ms to measured time 

blob.copyTo(outbuf);
forwardTime = System.currentTimeMillis() - t;

Maybe this is unrelated, but on the NVIDIA Shield I get an error message at startup which disappears when running with adb shell setprop debug.rs.default-CPU-driver 1

E/Renderscript: rsAssert failed: 0, in vendor/nvidia/tegra/compute/rs/driver/nv/rsdNvBcc.cpp

I'm setting compileSdkVersion, minSdkVersion and targetSdkVersion to 23 right now, with buildToolsVersion "23.0.2". The tablets are autoupdated to the very latest Android version. Not sure about the minimum target I need to set and still have ScriptIntrinsicsBLAS available.

I'm using #pragma rs_fp_relaxed in all scripts. The Allocations all use default flags.
This question has a similar situation, but it turned out OP was creating new Script objects every computational round. I do nothing of the sort, all Scripts and Allocations are created at init time.

回答1:

The original post has the mRS.finish() commented out. I am wondering if that is the case here.

To benchmark RenderScript properly, we should wait for pending asynchronous opeations to complete. There are generally two ways to do that:

Use RenderScript.finish(). This works well when using debug.rs.default-CPU-driver 1. And it also works with most GPU drivers. However, certain GPU driver does treat this as a NOOP.
Use Allocation.copyTo() or other similar APIs to access data of an Allocation, preferably the final output Allocation. This is actually a trick, but it works on all devices. Just be aware, the copyTo operation itself may take some time and make sure you take that into consideration.

5ms here seems suspicious, it might be real depending on the actually algorithm. But it worth double check if it is still the case when you add finish() or copyTo().

回答2:

That's very strange indeed. The fact that you're getting the same result across both devices and with two very different implementations of the conv layers suggests there is still something else going on with the benchmarking or timing itself, rather than differences with CPU/GPU execution, as things are rarely that conclusive.

I would suggest verifying the outputs from the copyTo()'s is always the same. Setup a logcat output of, say, the first (and last!) 10 values in the float array that comes back from each layer's output allocation to make sure all implementations and execution modes are truly processing the data properly and equally at each layer.

Depending on your setup, it's also possible that the data copying overhead I mentioned before might be overpowering the computation time itself and what you're seeing is just an unfortunate effect of that, as it's possible data copying from one place or another takes more or less time. Try increasing the conv kernel sizes or count (with dummy/random values, just for testing sake) to make the computations much more complex and thereby offset the computing vs data loading times balance, and see how that affects your results.

If all else fails, it could just be the GPU really is taking longer for some reason, though it can be hard to track down why. Some things to check... What data type and size are you using for the data? How are you loading/writing the data to the allocations? Are you using #pragma rs_fp_relaxed already to set your floats precision? What flags are you setting for the allocation usage (such as Allocation.USAGE_SCRIPT | Allocation.USAGE_GRAPHICS_TEXTURE)?

And as for your last question, detailed RS documentation on specific optimization matters is still very scarce unfortunately... I think just asking here on SO is still one of the best resources available for now :)

来源：https://stackoverflow.com/questions/37228427/renderscript-speedup-10x-when-forcing-default-cpu-implementation

标签

android

performance

cpu

renderscript