How to do correct timing of Android RenderScript code on Nvidia Shield

后端 未结 2 567
长情又很酷
长情又很酷 2021-01-12 10:50

I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they d

相关标签:
2条回答
  • 2021-01-12 11:15

    Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.

    Pros:

    1. High performance implementation of 8bit Matrix Multiplication (BNNM), available in N Preview.
    2. Back support back to Android 2.3 through RenderScript Support lib, when using Build-Tools 24.0.0 rc3 and above.
    3. High performance GPU acceleration of SGEMM on Nexus5X and 6P with N Preview build NPC91K.
    4. If you only use RenderScript Intrinsics, you can code everything in java.

    Cons:

    1. Your algorithm may need to be refactored, and need to be based on 2d matrix multiplication.
    2. Though available in Android 6.0, but BNNM performance in 6.0 is not satisfactory. So it is better to use support lib for BNNM and set targetSdkVersion to be 24.
    3. SGEMM GPU acceleration currently only available in Nexus5X and Nexus6P. And it currently requires the width and height of the Matrices to be multiples of 8.

    It's worth trying if BLAS fits into your algorithm. And it is easy to use:

        import android.support.v8.renderscript.*;
        // if you are not using support lib:
        // import android.renderscript.*;
    
        private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
            Allocation A, B, C;
            Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
            Type a_type = builder.setX(k).setY(m).create();
            Type b_type = builder.setX(k).setY(n).create();
            Type c_type = builder.setX(n).setY(m).create();
    
            // If you are reusing the input Allocations, just create and cache them somewhere else.
            A = Allocation.createTyped(mRS, a_type);
            B = Allocation.createTyped(mRS, b_type);
            C = Allocation.createTyped(mRS, c_type);
            A.copyFrom(a_byte);
            B.copyFrom(b_byte);
    
            ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
            // Computes: C = A * B.Transpose
            int a_offset = 0;
            int b_offset = 0;
            int c_offset = 0;
            int c_multiplier = 1;
            blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
        }
    

    SGEMM is similar:

            ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
            // Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
            // Computes: C = 1.0f * A * B + 0.0f * C
            float alpha = 1.0f;
            float beta = 0.0f;
            blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
                       alpha, A, B, beta, C);
    
    0 讨论(0)
  • 2021-01-12 11:18

    I've implemented CNNs in RenderScript myself, and as you explain, it does require chaining multiple processes and calling forEach_*() various times for each layer if you implement them each as a different kernel. As such, I can assure you that the forEach call returning does not really guarantee that the process has completed. In theory, this will only schedule the kernel and all queued up requests will actually run whenever the system determines it's best to, especially if they get processed in the tablet's GPU.

    Usually, the only way to make absolutely sure you have some kind of control over a kernel truly running is by explicitly reading the output of the RS kernel in between layers, such as by using .copyTo() on the output allocation object of that kernel. This "forces" any queued up RS jobs that have not run yet (on which that layer's output allocation is dependent), to execute at that time. Granted, that may introduce data transfer overheads and your timing will not be fully accurate -- in fact, the execution time of the full network will quite surely be lower than the sum of the individual layers if timed in this manner. But as far as I know, it's the only reliable way to time individual kernels in a chain and it will give you some feedback to find out where bottlenecks are, and to better guide your optimization, if that's what you're after.

    0 讨论(0)
提交回复
热议问题