I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they d
Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.
Pros:
Cons:
It's worth trying if BLAS fits into your algorithm. And it is easy to use:
import android.support.v8.renderscript.*;
// if you are not using support lib:
// import android.renderscript.*;
private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
Allocation A, B, C;
Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
Type a_type = builder.setX(k).setY(m).create();
Type b_type = builder.setX(k).setY(n).create();
Type c_type = builder.setX(n).setY(m).create();
// If you are reusing the input Allocations, just create and cache them somewhere else.
A = Allocation.createTyped(mRS, a_type);
B = Allocation.createTyped(mRS, b_type);
C = Allocation.createTyped(mRS, c_type);
A.copyFrom(a_byte);
B.copyFrom(b_byte);
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Computes: C = A * B.Transpose
int a_offset = 0;
int b_offset = 0;
int c_offset = 0;
int c_multiplier = 1;
blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
}
SGEMM is similar:
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
// Computes: C = 1.0f * A * B + 0.0f * C
float alpha = 1.0f;
float beta = 0.0f;
blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
alpha, A, B, beta, C);