I wrote some Naiive GEMM code and I am wondering why it is much slower than the equivalent single threaded GEMM code.
With a 200x200 matrix, Single Threaded: 7ms, Mu
In general multi-threading is well applicable for tasks which take a lot of time, most favourably because of complexity and not device access. The loop you showed us takes to short to execute for it to be effectively parallelized.
You have to remember that there is much overhead with thread creation. There is also some (but significantly less) overhead with synchronization.