RenderScript speedup 10x when forcing default CPU implementation

旧时模样 提交于 2019-12-03 20:49:26

The original post has the mRS.finish() commented out. I am wondering if that is the case here.

To benchmark RenderScript properly, we should wait for pending asynchronous opeations to complete. There are generally two ways to do that:

  1. Use RenderScript.finish(). This works well when using debug.rs.default-CPU-driver 1. And it also works with most GPU drivers. However, certain GPU driver does treat this as a NOOP.
  2. Use Allocation.copyTo() or other similar APIs to access data of an Allocation, preferably the final output Allocation. This is actually a trick, but it works on all devices. Just be aware, the copyTo operation itself may take some time and make sure you take that into consideration.

5ms here seems suspicious, it might be real depending on the actually algorithm. But it worth double check if it is still the case when you add finish() or copyTo().

That's very strange indeed. The fact that you're getting the same result across both devices and with two very different implementations of the conv layers suggests there is still something else going on with the benchmarking or timing itself, rather than differences with CPU/GPU execution, as things are rarely that conclusive.

I would suggest verifying the outputs from the copyTo()'s is always the same. Setup a logcat output of, say, the first (and last!) 10 values in the float array that comes back from each layer's output allocation to make sure all implementations and execution modes are truly processing the data properly and equally at each layer.

Depending on your setup, it's also possible that the data copying overhead I mentioned before might be overpowering the computation time itself and what you're seeing is just an unfortunate effect of that, as it's possible data copying from one place or another takes more or less time. Try increasing the conv kernel sizes or count (with dummy/random values, just for testing sake) to make the computations much more complex and thereby offset the computing vs data loading times balance, and see how that affects your results.

If all else fails, it could just be the GPU really is taking longer for some reason, though it can be hard to track down why. Some things to check... What data type and size are you using for the data? How are you loading/writing the data to the allocations? Are you using #pragma rs_fp_relaxed already to set your floats precision? What flags are you setting for the allocation usage (such as Allocation.USAGE_SCRIPT | Allocation.USAGE_GRAPHICS_TEXTURE)?

And as for your last question, detailed RS documentation on specific optimization matters is still very scarce unfortunately... I think just asking here on SO is still one of the best resources available for now :)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!