Calculating the sum of a large vector in parallel

前端 未结 2 981
隐瞒了意图╮
隐瞒了意图╮ 2021-01-06 05:03

Problem background

I have a program that currently takes way too long to sum up large std::vectors of ~100 million elements using std::accumulat

相关标签:
2条回答
  • 2021-01-06 05:23

    You can use Boost Asio as a thread pool. But there's not a lot of sense in it unless you have... asynchronous IO operations to coordinate.

    In this answer to "c++ work queues with blocking" I show two thread_pool implementations:

    • Solution #1: one based on boost::asio::io_service
    • Solution #2: the other based on boost::thread primitives

    Both accept any void() signature compatible task. This means, you could wrap your function-that-returns-the-important-results in a packaged_task<...> and get the future<RetVal> from it.

    0 讨论(0)
  • 2021-01-06 05:37

    Is Boost.Asio suitable for this problem?

    The main purpose of Boost.Asio is to provide an asynchronous model for network and I/O programming, and the problem you describe does not seem to have much to do with networking and I/O.

    I think that the simplest solution is to use the threading primitives provided by either Boost or the C++ standard library.

    A parallel algorithm

    Here's an example of a parallel version of accumulate created by only using the standard library.

    /* Minimum number of elements for multithreaded algorithm.
       Less than this and the algorithm is executed on single thread. */
    static const int MT_MIN_SIZE = 10000;
    
    template <typename InputIt, typename T>
    auto parallel_accumulate(InputIt first, InputIt last, T init) {
        // Determine total size.
        const auto size = std::distance(first, last);
        // Determine how many parts the work shall be split into.
        const auto parts = (size < MT_MIN_SIZE)? 1 : std::thread::hardware_concurrency();
    
        std::vector<std::future<T>> futures;
    
        // For each part, calculate size and run accumulate on a separate thread.
        for (std::size_t i = 0; i != parts; ++i) {
            const auto part_size = (size * i + size) / parts - (size * i) / parts;
            futures.emplace_back(std::async(std::launch::async,
                [=] { return std::accumulate(first, std::next(first, part_size), T{}); }));
            std::advance(first, part_size);
        }
    
        // Wait for all threads to finish execution and accumulate results.
        return std::accumulate(std::begin(futures), std::end(futures), init,
            [] (const T prev, auto& future) { return prev + future.get(); });
    }
    

    Live example (Parallel version performs about the same as sequential on Coliru, probably only 1 core available)

    Timings

    On my machine (using 8 threads) the parallel version gave, on average, a ~120 % boost in performance.

    Sequential sum:
    Time taken: 46 ms
    5000000050000000
    --------------------------------
    Parallel sum:
    Time taken: 21 ms
    5000000050000000

    However, the absolute gain for 100,000,000 elements is only marginal (25 ms). Although, the performance gain might be greater when accumulating a different element type than int.

    OpenMP

    As mentioned by @sehe in the comments, it is worth mentioning that OpenMP might provide a simple solution to this problem, e.g.

    template <typename T, typename U>
    auto omp_accumulate(const std::vector<T>& v, U init) {
        U sum = init;
    
        #pragma omp parallel for reduction(+:sum)
        for(std::size_t i = 0; i < v.size(); i++) {
            sum += v[i];
        }
    
        return sum;
    }
    

    On my machine this method performed the same as the parallel method using standard thread primitives.

    Sequential sum:
    Time taken: 46 ms
    5000000050000000
    --------------------------------
    Parallel sum:
    Time taken: 21 ms
    Sum: 5000000050000000
    --------------------------------
    OpenMP sum:
    Time taken: 21 ms
    Sum: 5000000050000000

    0 讨论(0)
提交回复
热议问题