Multi Threading Performance in Multiplication of 2 Arrays / Images - Intel IPP

烂漫一生 提交于 2019-12-05 12:47:25

Your images occupy 200 MB in total (2 x 5000 x 5000 x 4 bytes). Each block therefore consists of 50 MB of data. This is more than 6 times than the size of your CPU's L3 cache (see here). Each AVX vector multiplication operates on 256 bits of data, which is half a cache line, i.e. it consumes one cache line per vector instruction (half a cache line for each argument). A vectorised multiplication on Haswell has a latency of 5 cycles and the FPU can retire two such instructions per cycle (see here). The memory bus of i7-4770K is rated at 25.6 GB/s (theoretical maximum!) or no more than 430 million cache lines per second . The nominal speed of the CPU is 3.5 GHz. The AVX part is clocked a bit lower, let's say at 3.1 GHz. At that speed, it takes an order of magnitude more cache lines per second to fully feed the AVX engine.

In those conditions, a single thread of vectorised code saturates almost fully the memory bus of your CPU. Adding a second thread might result in a very slight improvement. Adding further threads only results in contentions and added overhead. The only way to speed up such a calculation is to increase the memory bandwidth:

  • run on a NUMA system with more memory controllers and therefore higher aggregate memory bandwidth, e.g. a multisocket server board;
  • switch to a different architecture with much higher memory bandwidth, e.g. Intel Xeon Phi or a GPGPU.

From some researching on my own, it looks like your total CPU cache is around 8MB. 6000*4/4 (6000 floats split into blocks of 4) is 6MB. Multiply this by 2 (in and out), and you're outside of the cache.

I haven't tested this, but increasing the number of blocks should increase the performannce. Try 8 to start out with (your CPU siports hyperthreading to 8 virtual cores).

Currently, each of the different processes spawned on OpenMP is having cache conflicts and having to (re)load from main memory. Reducing the size of the blocks can help with this. Having distinct cahces would effectively increase the size of your cache, but it seems thats not an option.

If you're just doing this as a proof of principle, you may want to test this by running it on your graphics card. Although, that can be even harder to implement properly.

If you run with hyperthread enabled you should try the openmp version of ipp with 1 thread per core and set omp_places=cores if ipp doesn't do it automatically. If you use Cilk_ ipp try varying cilk_workers. You might try a test case large enough to span multiple 4kb pages. Then additional factors come into play. Ideally, ipp will put the threads to work on different pages. On Linux (or Mac?) transparent huge pages should kick in. On Windows, haswell CPU introduced hardware page prefetch which should reduce but not eliminate importance of thp.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!