How to do correct timing of Android RenderScript code on Nvidia Shield

问题

I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they do not.

The CNN (LeNet) is implemented in 9 layers residing in a queue, computation is performed in sequence. Each layer is timed individually.

Here is an example:

       conv1  pool1 conv2  pool2 resh1 ip1    relu1  ip2    softmax
nexus7 11.177 7.813 13.357 8.367 8.097 2.1    0.326  1.557  2.667
shield 13.219 1.024 1.567  1.081 0.988 14.588 13.323 14.318 40.347

The distribution of the times are about right for the nexus, with conv1 and conv2 (convolution layers) taking most of the time. But on the shield, the times drop way beyond what's reasonable for layers 2-4 and seem to gather up towards the end. The softmax layer is a relatively small job, so 40ms is way too large. My timing method must be faulty, or something else is going on.

The code running the layers looks something like this:

double[] times = new double[layers.size()];
int layerindex = 0;
for (Layer a : layers) {

    double t = SystemClock.elapsedRealtime(); 
    //long t = System.currentTimeMillis(); // makes no difference

    blob = a.forward(blob); // here we call renderscript forEach_(), invoke_() etc

    //mRS.finish(); // makes no difference

    t = SystemClock.elapsedRealtime() - t; 
    //t = System.currentTimeMillis() - t; // makes no difference

    times[layerindex] += t; // later we take average etc

    layerindex++;
}

It is my understanding that once forEach_() returns, the job is supposed to be finished. In any case, mRS.finish() should provide a final barrier. But looking at the times, the only reasonable explanation is that jobs are still processed in the background.

The app is very simple, I just run the test from MainActivity and print to logcat. Android Studio builds the app as a release and runs it on the device which is connected by USB.

(1) What is the correct way to time RenderScript processes? (2) Is it true that when forEach_() returns, the threads spawned by the script are guaranteed to be done? (3) In my test app, I simply run directly from the MainActivity. Is this a problem (other than blocking the UI thread and making the app unresponsive)? If this influences the timing or causes the weirdness, what is a proper way to set up a test app like this?

回答1:

I've implemented CNNs in RenderScript myself, and as you explain, it does require chaining multiple processes and calling forEach_*() various times for each layer if you implement them each as a different kernel. As such, I can assure you that the forEach call returning does not really guarantee that the process has completed. In theory, this will only schedule the kernel and all queued up requests will actually run whenever the system determines it's best to, especially if they get processed in the tablet's GPU.

Usually, the only way to make absolutely sure you have some kind of control over a kernel truly running is by explicitly reading the output of the RS kernel in between layers, such as by using .copyTo() on the output allocation object of that kernel. This "forces" any queued up RS jobs that have not run yet (on which that layer's output allocation is dependent), to execute at that time. Granted, that may introduce data transfer overheads and your timing will not be fully accurate -- in fact, the execution time of the full network will quite surely be lower than the sum of the individual layers if timed in this manner. But as far as I know, it's the only reliable way to time individual kernels in a chain and it will give you some feedback to find out where bottlenecks are, and to better guide your optimization, if that's what you're after.

回答2:

Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.

Pros:

High performance implementation of 8bit Matrix Multiplication (BNNM), available in N Preview.
Back support back to Android 2.3 through RenderScript Support lib, when using Build-Tools 24.0.0 rc3 and above.
High performance GPU acceleration of SGEMM on Nexus5X and 6P with N Preview build NPC91K.
If you only use RenderScript Intrinsics, you can code everything in java.

Cons:

Your algorithm may need to be refactored, and need to be based on 2d matrix multiplication.
Though available in Android 6.0, but BNNM performance in 6.0 is not satisfactory. So it is better to use support lib for BNNM and set targetSdkVersion to be 24.
SGEMM GPU acceleration currently only available in Nexus5X and Nexus6P. And it currently requires the width and height of the Matrices to be multiples of 8.

It's worth trying if BLAS fits into your algorithm. And it is easy to use:

    import android.support.v8.renderscript.*;
    // if you are not using support lib:
    // import android.renderscript.*;

    private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
        Allocation A, B, C;
        Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
        Type a_type = builder.setX(k).setY(m).create();
        Type b_type = builder.setX(k).setY(n).create();
        Type c_type = builder.setX(n).setY(m).create();

        // If you are reusing the input Allocations, just create and cache them somewhere else.
        A = Allocation.createTyped(mRS, a_type);
        B = Allocation.createTyped(mRS, b_type);
        C = Allocation.createTyped(mRS, c_type);
        A.copyFrom(a_byte);
        B.copyFrom(b_byte);

        ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
        // Computes: C = A * B.Transpose
        int a_offset = 0;
        int b_offset = 0;
        int c_offset = 0;
        int c_multiplier = 1;
        blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
    }

SGEMM is similar:

        ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
        // Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
        // Computes: C = 1.0f * A * B + 0.0f * C
        float alpha = 1.0f;
        float beta = 0.0f;
        blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
                   alpha, A, B, beta, C);

来源：https://stackoverflow.com/questions/37080673/how-to-do-correct-timing-of-android-renderscript-code-on-nvidia-shield

标签

android

multithreading

performance

timing

renderscript