Is it much faster to re-initialize a vector using OpenMP threads?

前端 未结 2 508
孤城傲影
孤城傲影 2021-01-23 15:03

I\'m using OpenMP libraries for parallel computing. I use C++ vectors, whose size is usually in the order of 1*10^5. While going through iteration process, I need to re-initiali

2条回答
  •  无人共我
    2021-01-23 15:14

    Assuming simple initialization of primitive datatypes, the initialization itself will be bound by memory or cache bandwidth. However, on modern systems you must use multiple threads to fully utilize both your memory and cache bandwidth. For example take a look at these benchmark results, where the first two rows compare parallel versus single threaded cache, and the last two rows parallel vs. single threaded main memory bandwidth. On high-performance oriented system, especially with multiple sockets, more threads are very important to exploit the available bandwidth.

    However, the performance of the re-initialization is not the only thing you should care about. Assuming for instance double precision floating point numbers, 10e5 elements equal to 800 kb memory, which fits into caches. To improve overall performance, you should try to ensure that after initialization the data is in a cache close to the core that later accesses the data. In a NUMA system (multiple sockets with faster memory access to their local memory), this is even more important.

    If you do initialize shared memory concurrently, make sure to not write the same cache line from different cores, and try to keep the access pattern regular to not confuse prefetchers and other clever magic of the CPU.

    The general recommendation is: Start with a simple implementation and later analyze your application to understand where the bottleneck actually is. Do not invest in complex, hard to maintain, system specific optimizations that may only affect a tiny faction of your codes overall runtime. If it turns out this is a bottleneck for your application, and your hardware resources are not utilized well, then you need to understand the performance characteristics of your underlying hardware (local/shared caches, NUMA, prefetchers) and tune your code accordingly.

提交回复
热议问题