Multi Threading Performance in Multiplication of 2 Arrays / Images - Intel IPP

I'm using Intel IPP for multiplication of 2 Images (Arrays).
I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.

I created a simple function to multiply too large images (The whole project is attached, see below).
I wanted to see the gains using Intel IPP Multi Threaded Library.

Here is the simple project (I also attached the complete project form Visual Studio):

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

const int height = 6000;
const int width  = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    for (int i = 0; i < 200; i++)
        ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size); 

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.

I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).

I wonder, how come there is no gain in this task with multi threading?
I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?

I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.
This is the code:

#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"

#include <ctime>
#include <iostream>

using namespace std;

#include <omp.h>

const int height = 5000;
const int width  = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};

int main()
{
    IppiSize size = {width, height};

    double start = clock();

    IppiSize blockSize = {width, height / 4};

    const int NUM_BLOCK = 4;
    omp_set_num_threads(NUM_BLOCK);

    Ipp32f*  in;
    Ipp32f*  out;

    //  ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);

    #pragma omp parallel            \
    shared(mInput_image, mOutput_image, blockSize) \
    private(in, out)
    {
        int id   = omp_get_thread_num();
        int step = blockSize.width * blockSize.height * id;
        in       = mInput_image  + step;
        out      = mOutput_image + step;
        ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
    }

    double end = clock();
    double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);

    cout << douration << endl;
    cin.get();

    return 0;
}

The results were the same, again, no gain of performance.

Is there a way to benefit from Multi Threading in this kind of task?
How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it? Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?

The Computers I tried it on is based on Core i7 4770k (Haswell).

Here is a link to the Project in Visual Studio 2013.

Thank You.

Your images occupy 200 MB in total (2 x 5000 x 5000 x 4 bytes). Each block therefore consists of 50 MB of data. This is more than 6 times than the size of your CPU's L3 cache (see here). Each AVX vector multiplication operates on 256 bits of data, which is half a cache line, i.e. it consumes one cache line per vector instruction (half a cache line for each argument). A vectorised multiplication on Haswell has a latency of 5 cycles and the FPU can retire two such instructions per cycle (see here). The memory bus of i7-4770K is rated at 25.6 GB/s (theoretical maximum!) or no more than 430 million cache lines per second . The nominal speed of the CPU is 3.5 GHz. The AVX part is clocked a bit lower, let's say at 3.1 GHz. At that speed, it takes an order of magnitude more cache lines per second to fully feed the AVX engine.

In those conditions, a single thread of vectorised code saturates almost fully the memory bus of your CPU. Adding a second thread might result in a very slight improvement. Adding further threads only results in contentions and added overhead. The only way to speed up such a calculation is to increase the memory bandwidth:

run on a NUMA system with more memory controllers and therefore higher aggregate memory bandwidth, e.g. a multisocket server board;
switch to a different architecture with much higher memory bandwidth, e.g. Intel Xeon Phi or a GPGPU.

From some researching on my own, it looks like your total CPU cache is around 8MB. 6000*4/4 (6000 floats split into blocks of 4) is 6MB. Multiply this by 2 (in and out), and you're outside of the cache.

I haven't tested this, but increasing the number of blocks should increase the performannce. Try 8 to start out with (your CPU siports hyperthreading to 8 virtual cores).

Currently, each of the different processes spawned on OpenMP is having cache conflicts and having to (re)load from main memory. Reducing the size of the blocks can help with this. Having distinct cahces would effectively increase the size of your cache, but it seems thats not an option.

If you're just doing this as a proof of principle, you may want to test this by running it on your graphics card. Although, that can be even harder to implement properly.

If you run with hyperthread enabled you should try the openmp version of ipp with 1 thread per core and set omp_places=cores if ipp doesn't do it automatically. If you use Cilk_ ipp try varying cilk_workers. You might try a test case large enough to span multiple 4kb pages. Then additional factors come into play. Ideally, ipp will put the threads to work on different pages. On Linux (or Mac?) transparent huge pages should kick in. On Windows, haswell CPU introduced hardware page prefetch which should reduce but not eliminate importance of thp.

来源：https://stackoverflow.com/questions/36966474/multi-threading-performance-in-multiplication-of-2-arrays-images-intel-ipp

标签

c++

multithreading

openmp

intel-ipp