I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they d
Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.
Pros:
Cons:
It's worth trying if BLAS fits into your algorithm. And it is easy to use:
import android.support.v8.renderscript.*;
// if you are not using support lib:
// import android.renderscript.*;
private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
Allocation A, B, C;
Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
Type a_type = builder.setX(k).setY(m).create();
Type b_type = builder.setX(k).setY(n).create();
Type c_type = builder.setX(n).setY(m).create();
// If you are reusing the input Allocations, just create and cache them somewhere else.
A = Allocation.createTyped(mRS, a_type);
B = Allocation.createTyped(mRS, b_type);
C = Allocation.createTyped(mRS, c_type);
A.copyFrom(a_byte);
B.copyFrom(b_byte);
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Computes: C = A * B.Transpose
int a_offset = 0;
int b_offset = 0;
int c_offset = 0;
int c_multiplier = 1;
blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
}
SGEMM is similar:
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
// Computes: C = 1.0f * A * B + 0.0f * C
float alpha = 1.0f;
float beta = 0.0f;
blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
alpha, A, B, beta, C);
I've implemented CNNs in RenderScript myself, and as you explain, it does require chaining multiple processes and calling forEach_*()
various times for each layer if you implement them each as a different kernel. As such, I can assure you that the forEach call returning does not really guarantee that the process has completed. In theory, this will only schedule the kernel and all queued up requests will actually run whenever the system determines it's best to, especially if they get processed in the tablet's GPU.
Usually, the only way to make absolutely sure you have some kind of control over a kernel truly running is by explicitly reading the output of the RS kernel in between layers, such as by using .copyTo()
on the output allocation object of that kernel. This "forces" any queued up RS jobs that have not run yet (on which that layer's output allocation is dependent), to execute at that time. Granted, that may introduce data transfer overheads and your timing will not be fully accurate -- in fact, the execution time of the full network will quite surely be lower than the sum of the individual layers if timed in this manner. But as far as I know, it's the only reliable way to time individual kernels in a chain and it will give you some feedback to find out where bottlenecks are, and to better guide your optimization, if that's what you're after.