I wrote some Naiive GEMM code and I am wondering why it is much slower than the equivalent single threaded GEMM code.
With a 200x200 matrix, Single Threaded: 7ms, Mu
Multi threading means always synchronization, context switching, function call. This all adds up and costs CPU cycles, you can spend on the main task itself.
If you have just a third nested loop, you save all these steps and can do the computation inline instead of a subroutine, where you must setup a stack, call into, switch to a different thread, return the result and switch back to the main thread.
Multi threading is useful only, if these costs are small compared to the main task. I guess, you will see better results with multi threading, when the matrix is larger than just 200x200.
In general multi-threading is well applicable for tasks which take a lot of time, most favourably because of complexity and not device access. The loop you showed us takes to short to execute for it to be effectively parallelized.
You have to remember that there is much overhead with thread creation. There is also some (but significantly less) overhead with synchronization.
I do not have experience with GEMM, but your problem seems to be related to issues that appear in all kind of multi-threading scenarios.
When using multi-threading, you introduce a couple of potential overheads, the most common of which usually are
The items 2. and 3. probably don't play a role in your example: you are using 12 threads on 12 (hyperthreading) cores, and your algorithm does not involve locks.
However, 1. might be relevant in your case: You are creating a total of 40000 threads, each of which multiplying and adding 200 values. I'd suggest to try a less fine-grained threading, maybe only splitting after the first loop. It's always a good idea not to split up the problem into pieces smaller than necessary.
Also 4. will very likely be important in your case. While you're not running into a race condition when writing the results to the array (because every thread is writing to its own index position), you are very likely to provoke a large overhead of cache syncs.
"Why?" you might think, because you're writing to different places in memory. That's because a typical CPU cache is organized in cache lines, which on the current Intel and AMD CPU models are 64 bytes long. This is the smallest size that can be used for transfers from and to the cache, when something is changed. Now that all CPU cores are reading and writing to adjacent memory words, this leads to synchronization of 64 bytes between all the cores whenever you are writing just 4 bytes (or 8, depending on the size of the data type you're using).
If memory is not an issue, you can simply "pad" every output array element with "dummy" data so that there is only one output element per cache line. If you're using 4byte data types, this would mean to skip 15 array elements for each 1 real data element. The cache issues will also improve when you make your threading less fine-grained, because every thread will access its own continuous region in memory practically without interfering with other threads' memory.
Edit: A more detailed description by Herb Sutter (one of the Gurus of C++) can be found here: http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
Edit2: BTW, it's suggested to avoid std::move
in the return statement, as this might get in the way of return-value-optimization and copy-elision rules, which the standard now demands to happen automatically. See Is returning with `std::move` sensible in the case of multiple return statements?