How to do correct timing of Android RenderScript code on Nvidia Shield

后端 未结 2 564
长情又很酷
长情又很酷 2021-01-12 10:50

I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they d

2条回答
  •  北恋
    北恋 (楼主)
    2021-01-12 11:15

    Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.

    Pros:

    1. High performance implementation of 8bit Matrix Multiplication (BNNM), available in N Preview.
    2. Back support back to Android 2.3 through RenderScript Support lib, when using Build-Tools 24.0.0 rc3 and above.
    3. High performance GPU acceleration of SGEMM on Nexus5X and 6P with N Preview build NPC91K.
    4. If you only use RenderScript Intrinsics, you can code everything in java.

    Cons:

    1. Your algorithm may need to be refactored, and need to be based on 2d matrix multiplication.
    2. Though available in Android 6.0, but BNNM performance in 6.0 is not satisfactory. So it is better to use support lib for BNNM and set targetSdkVersion to be 24.
    3. SGEMM GPU acceleration currently only available in Nexus5X and Nexus6P. And it currently requires the width and height of the Matrices to be multiples of 8.

    It's worth trying if BLAS fits into your algorithm. And it is easy to use:

        import android.support.v8.renderscript.*;
        // if you are not using support lib:
        // import android.renderscript.*;
    
        private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
            Allocation A, B, C;
            Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
            Type a_type = builder.setX(k).setY(m).create();
            Type b_type = builder.setX(k).setY(n).create();
            Type c_type = builder.setX(n).setY(m).create();
    
            // If you are reusing the input Allocations, just create and cache them somewhere else.
            A = Allocation.createTyped(mRS, a_type);
            B = Allocation.createTyped(mRS, b_type);
            C = Allocation.createTyped(mRS, c_type);
            A.copyFrom(a_byte);
            B.copyFrom(b_byte);
    
            ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
            // Computes: C = A * B.Transpose
            int a_offset = 0;
            int b_offset = 0;
            int c_offset = 0;
            int c_multiplier = 1;
            blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
        }
    

    SGEMM is similar:

            ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
            // Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
            // Computes: C = 1.0f * A * B + 0.0f * C
            float alpha = 1.0f;
            float beta = 0.0f;
            blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
                       alpha, A, B, beta, C);
    

提交回复
热议问题