Simple division of labour over threads is not reducing the time taken

问题

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.

What I have attempted to do is:

do a task X times in one thread - check the time taken.
do a task X / Y times in Y threads - check the time taken.

So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:

2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds

and so on until I reach some threading limit (number of cores or whatever).

So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.

1 Threads took: ~6.5s
2 Threads took: ~6.7s
3 Threads took: ~13.3s
8 Threads took: ~16.2s

So clearly something is not right here!

I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr

Finally here is the code for reference:

#include <iostream>
#include <math.h>
#include <vector>
#include <thread>

#define randf() ((double) rand()) / ((double) (RAND_MAX))

void thread_func(uint32_t interations, uint32_t thread_id)
{
    // Print the thread id / workload
    std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
    // Get the start time
    auto start = std::chrono::high_resolution_clock::now();
    // do some work for the required number of interations
    for (auto i = 0u; i < interations; i++)
    {
        double value = randf();
        double calc = std::atan(value);
        (void) calc;
    }
    // Get the time taken
    auto total_time = std::chrono::high_resolution_clock::now() - start;
    // Print it out
    std::cout << "thread: " << thread_id << " finished after: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
              << "ms" << std::endl;
}

int main()
{
    // Note these numbers vary by about probably due to godbolt servers load (?)
    // 1 Threads takes: ~8s
    // 2 Threads takes: ~8s
    // 3 Threads takes: ~8s
    uint32_t num_threads = 3; // Max 3 in godbolt
    uint32_t total_work = 100'000'000;

    // Seed rand
    std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));

    // Store the start time
    auto overall_start = std::chrono::high_resolution_clock::now();

    // Start all the threads doing work
    std::vector<std::thread> task_list;
    for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
    {
        task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
    }

    // Wait for the threads to finish
    for (auto &task : task_list)
    {
        task.join();
    }

    // Get the end time and print it
    auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
    std::cout << "\n==========================\n"
              << "thread overall_total_time time: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
              << "ms" << std::endl;
    return 0;
}

Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.

I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:

CPU bound (needs lots of CPU resources): I have 8 cores
Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
I/O bound (Network and/or hard drive resources): No network trafic involved
There is no sleeping/mutexing going on here (like there is in my real project)

Questions are:

Why might this be happening?
What am I doing wrong?
How can I improve this?

回答1:

The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.

You can use the facilities of the <random> header and have each thread use it's own engine to generate the random numbers.

回答2:

Never mind that rand() is or isn't thread safe. That might be the explanation if a statistician told you that the "random" numbers you were getting were defective in some way, but it doesn't explain the timing.

What explains the timing is that there is only one random state object, it's out in memory somewhere, and all of your threads are competing with each other to access it.

No matter how many CPUs your system has, only one thread at a time can access the same location in main memory.

It would be different if each of the threads had its own independent random state object. Then, most of the accesses from any given CPU to its own private random state would only have to go as far as the CPU's local cache, and they would not conflict with what the other threads, running on other CPUs, each with their own local cache were doing.

来源：https://stackoverflow.com/questions/65242486/simple-division-of-labour-over-threads-is-not-reducing-the-time-taken

标签

c++

Linux

multithreading

c++17