OpenCV: C++ and C performance comparison

后端 未结 2 1183
予麋鹿
予麋鹿 2020-12-22 19:22

Right now I\'m developing some application using OpenCV API (C++). This application does processing with video.

On the PC everything works really fast.

相关标签:
2条回答
  • 2020-12-22 20:15

    I've worked quite a lot with Android and optimizations (I wrote a video processing app that processes a frame in 4ms) so I hope I will give you some pertinent answers.

    There is not much difference between the C and C++ interface in OpenCV. Some of the code is written in C, and has a C++ wrapper, and some viceversa. Any significant differences between the two (as measured by Shervin Emami) are either regressions, bug fixes or quality improvements. You should stick with the latest OpenCV version.

    Why not rewrite?

    You will spend a good deal of time, which you could use much better. The C interface is cumbersome, and the chance to introduce bugs or memory leaks is high. You should avoid it, in my opinion.

    Advice for optimization

    A. Turn on optimizations.

    Both compiler optimizations and the lack of debug assertions can make a big difference in your running time.

    B. Profile your app.

    Do it first on your computer, since it is much easier. Use visual studio profiler, to identify the slow parts. Optimize them. Never optimize because you think is slow, but because you measure it. Start with the slowest function, optimize it as much as possible, then take the second slower. Measure your changes to make sure it's indeed faster.

    C. Focus on algorithms.

    A faster algorithm can improve performance with orders of magnitude (100x). A C++ trick will give you maybe 2x performance boost.

    Classical techniques:

    • Resize you video frames to be smaller. Often you can extract the information from a 200x300px image, instead of a 1024x768. The area of the first one is 10 times smaller.

    • Use simpler operations instead of complicated ones. Use integers instead of floats. And never use double in a matrix or a for loop that executes thousands of times.

    • Do as little calculation as possible. Can you track an object only in a specific area of the image, instead of processing it all for all the frames? Can you make a rough/approximate detection on a very small image and then refine it on a ROI in the full frame?

    D. Use C where it matters

    In loops, it may make sense to use C style instead of C++. A pointer to a data matrix or a float array is much faster than mat.at or std::vector<>. Often the bottleneck is a nested loop. Focus on it. It doesn't make sense to replace vector<> all over the place and spaghettify your code.

    E. Avoid hidden costs

    Some OpenCV functions convert data to double, process it, then convert back to the input format. Beware of them, they kill performance on mobile devices. Examples: warping, scaling, type conversions. Also, color space conversions are known to be lazy. Prefer grayscale obtained directly from native YUV.

    F. Use vectorization

    ARM processors implement vectorization with a technology called NEON. Learn to use it. It is powerful!

    A small example:

    float* a, *b, *c;
    // init a and b to 1000001 elements
    for(int i=0;i<1000001;i++)
        c[i] = a[i]*b[i];
    

    can be rewritten as follows. It's more verbose, but much faster.

    float* a, *b, *c;
    // init a and b to 1000001 elements
    float32x4_t _a, _b, _c;
    int i;
    for(i=0;i<1000001;i+=4)
    {  
        a_ = vld1q_f32( &a[i] ); // load 4 floats from a in a NEON register
        b_ = vld1q_f32( &b[i] );
        c_ = vmulq_f32(a_, b_); // perform 4 float multiplies in parrallel
        vst1q_f32( &c[i], c_); // store the four results in c
    }
    // the vector size is not always multiple of 4 or 8 or 16. 
    // Process the remaining elements
    for(;i<1000001;i++)
        c[i] = a[i]*b[i];
    

    Purists say you must write in assembler, but for a regular programmer that's a bit daunting. I had good results using gcc intrinsics, like in the above example.

    Another way to jump-start is to convrt handcoded SSE-optimized code in OpenCV into NEON. SSE is the NEON equivalent in Intel processors, and many OpenCV functions use it, like here. This is the image filtering code for uchar matrices (the regular image format). You should't blindly convert instructions one by one, but take it as an example to start with.

    You can read more about NEON in this blog and the following posts.

    G. Pay attention to image capture

    It can be surprisingly slow on a mobile device. Optimizing it is device and OS specific.

    0 讨论(0)
  • Before making any decision like this, you should profile your code to locate the hotspots in your code. Without this information, any changes you make to speed things up will be guesswork. Have you tried this Android NDK profiler?

    0 讨论(0)
提交回复
热议问题