问题
I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.
What I have attempted to do is:
- do a task X times in one thread - check the time taken.
- do a task X / Y times in Y threads - check the time taken.
So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:
- 2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
- 3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds
and so on until I reach some threading limit (number of cores or whatever).
So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.
- 1 Threads took: ~6.5s
- 2 Threads took: ~6.7s
- 3 Threads took: ~13.3s
- 8 Threads took: ~16.2s
So clearly something is not right here!
I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr
Finally here is the code for reference:
#include <iostream>
#include <math.h>
#include <vector>
#include <thread>
#define randf() ((double) rand()) / ((double) (RAND_MAX))
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
for (auto i = 0u; i < interations; i++)
{
double value = randf();
double calc = std::atan(value);
(void) calc;
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << std::endl;
}
int main()
{
// Note these numbers vary by about probably due to godbolt servers load (?)
// 1 Threads takes: ~8s
// 2 Threads takes: ~8s
// 3 Threads takes: ~8s
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Seed rand
std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.
I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:
CPU bound (needs lots of CPU resources)
: I have 8 coresMemory bound (needs lots of RAM resources)
: I have assigned my VM 10GB ram, running nothing elseI/O bound (Network and/or hard drive resources)
: No network trafic involved- There is no sleeping/mutexing going on here (like there is in my real project)
Questions are:
- Why might this be happening?
- What am I doing wrong?
- How can I improve this?
回答1:
The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand
, the performance suffers with multiple threads.
You can use the facilities of the <random>
header and have each thread use it's own engine to generate the random numbers.
回答2:
Never mind that rand()
is or isn't thread safe. That might be the explanation if a statistician told you that the "random" numbers you were getting were defective in some way, but it doesn't explain the timing.
What explains the timing is that there is only one random state object, it's out in memory somewhere, and all of your threads are competing with each other to access it.
No matter how many CPUs your system has, only one thread at a time can access the same location in main memory.
It would be different if each of the threads had its own independent random state object. Then, most of the accesses from any given CPU to its own private random state would only have to go as far as the CPU's local cache, and they would not conflict with what the other threads, running on other CPUs, each with their own local cache were doing.
来源:https://stackoverflow.com/questions/65242486/simple-division-of-labour-over-threads-is-not-reducing-the-time-taken