Is multi-thread memory access faster than single threaded memory access?

问题

Assume we are in C language. A simple example is as follows. If I have a gigantic array A and I want to copy A to array B with the same size as A. Is using multithreading to do memory copy faster than it with a single thread? How many threads are suitable to do this kind of memory operation?

EDIT: Let me put the question more narrow. First of all, we do not consider the GPU case. The memory access optimization is very important and effective when we do GPU programming. In my experience, we always need to be careful about the memory operations. On the other hand, it is not always the case when we work on CPU. In addition, let's not consider about the SIMD instructions, such as avx and sse. Those will also show memory performance issues when the program has too many memory access operations as opposed to a lot of computational operations. Assume that we work an x86 architecture with 1-2 CPUs. Each CPU has multiple cores and a quad channel memory interface. The main memory is DDR4, as it is common today.

My array is an array of double precision floating point numbers with the size similar to the size of L3 cache of a CPU, that is roughly 50MB. Now, I have two cases: 1) copy this array to another array with the same size using by doing element-wise copy or by using memcpy. 2) combine a lot of small arrays into this gigantic array. Both are real-time operations, meaning that they need to be done as fast as possible. Does multi-threading give a speedup or a dropdown? What's the factor in this case that affects the performance of memory operations?

Someone said it will mostly depend on DMA performance. I think it is when we do memcpy. What if we do element-wise copy, does the pass through the CPU cache first?

回答1:

It depends on many factors. One factor is the hardware you use. On modern PC hardware, multithreading will most likely not lead to performance improvement, because CPU time is not the limiting factor of copy operations. The limiting factor is the memory interface. The CPU will most likely use the DMA controller to do the copying, so the CPU will not be too busy when copying data.

回答2:

Over the years, CPU performance increased greatly, literally exponentiated. RAM performance couldn't catch up. It actually made the cache more important. Especially after celeron.

So you can have increase or decrease in performance:

Depending heavily on

memory fetch and memory store units per core
memory controller modules
pipeline depths of memory modules and enumeration of memory banks
memory accessing patterns of each thread(software)
Alignments of data chunks, instruction blobs
Sharing and its datapaths of common hardware resources
Operating system doing too much preemption for all threads

Simply optimize the code for cache, then the quality of cpu will decide the performance.

Example:

FX8150 has weaker cores than a i7-4700:

FX cores can have scaling with extra threads but i7 tops with just single thread (I mean memory-heavy codes)
FX has more L3 but it is slower
FX can work with higher frequency RAM but i7 has better inter-core data bandwidth (incase of 1 thread sending data to another thread)
FX pipeline is too long, too long to recover after a branch

it looks like AMD can share more finer-grained performance to threads while INTEL does give power to a single thread. (council assembly vs monarchy) Maybe thats why AMD is better at GPU and HBM.

If I had to stop speculation, I would care only for cache as it is not alterable in cpu while RAM can have many combinations on a motherboard.

回答3:

Assuming AMD/Intel64 architecture.

One core is not capable of saturating the memory bandwidth. But this means not that multi-threaded is faster. For that the threads must be on different cores, launching as many threads as there is physical cores should give a speed up as the OS would most likely assign the threads to different cores, but in you threading library there should be a function binding a thread to a specific core, using this is the best for speed. Another thing to think about is NUMA, if you have a multi socket system. For maximum speed you should also think about using AVX instructions.

来源：https://stackoverflow.com/questions/42099924/is-multi-thread-memory-access-faster-than-single-threaded-memory-access

标签

multithreading

memory