I have a program that currently takes way too long to sum up large std::vector
s of ~100 million elements using std::accumulat
You can use Boost Asio as a thread pool. But there's not a lot of sense in it unless you have... asynchronous IO operations to coordinate.
In this answer to "c++ work queues with blocking" I show two thread_pool
implementations:
boost::asio::io_service
boost::thread
primitivesBoth accept any void()
signature compatible task. This means, you could wrap your function-that-returns-the-important-results in a packaged_task<...>
and get the future<RetVal>
from it.
The main purpose of Boost.Asio is to provide an asynchronous model for network and I/O programming, and the problem you describe does not seem to have much to do with networking and I/O.
I think that the simplest solution is to use the threading primitives provided by either Boost or the C++ standard library.
Here's an example of a parallel version of accumulate
created by only using the standard library.
/* Minimum number of elements for multithreaded algorithm.
Less than this and the algorithm is executed on single thread. */
static const int MT_MIN_SIZE = 10000;
template <typename InputIt, typename T>
auto parallel_accumulate(InputIt first, InputIt last, T init) {
// Determine total size.
const auto size = std::distance(first, last);
// Determine how many parts the work shall be split into.
const auto parts = (size < MT_MIN_SIZE)? 1 : std::thread::hardware_concurrency();
std::vector<std::future<T>> futures;
// For each part, calculate size and run accumulate on a separate thread.
for (std::size_t i = 0; i != parts; ++i) {
const auto part_size = (size * i + size) / parts - (size * i) / parts;
futures.emplace_back(std::async(std::launch::async,
[=] { return std::accumulate(first, std::next(first, part_size), T{}); }));
std::advance(first, part_size);
}
// Wait for all threads to finish execution and accumulate results.
return std::accumulate(std::begin(futures), std::end(futures), init,
[] (const T prev, auto& future) { return prev + future.get(); });
}
Live example (Parallel version performs about the same as sequential on Coliru, probably only 1 core available)
On my machine (using 8 threads) the parallel version gave, on average, a ~120 % boost in performance.
Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
5000000050000000
However, the absolute gain for 100,000,000 elements is only marginal (25 ms). Although, the performance gain might be greater when accumulating a different element type than int
.
As mentioned by @sehe in the comments, it is worth mentioning that OpenMP might provide a simple solution to this problem, e.g.
template <typename T, typename U>
auto omp_accumulate(const std::vector<T>& v, U init) {
U sum = init;
#pragma omp parallel for reduction(+:sum)
for(std::size_t i = 0; i < v.size(); i++) {
sum += v[i];
}
return sum;
}
On my machine this method performed the same as the parallel method using standard thread primitives.
Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
Sum: 5000000050000000
--------------------------------
OpenMP sum:
Time taken: 21 ms
Sum: 5000000050000000