Parallel for loop over range of array indices in C++17

问题

I need to update a 100M-element array and would like to do it in parallel. std::for_each(std::execution::par, ...) seems great for this, except that the update needs to access elements of other arrays depending on the index that I am updating. A minimal serial working example of the kind of thing I'm trying to parallelize might look like this:

for (size_t i = 0; i < 100'000'000; i++)
    d[i] = combine(d[i], s[2*i], s[2*i+1]);

I could of course manually spawn threads, but that is a lot more code than std::for_each, so it would be great to find an elegant way to do this with the standard library. So far I have found some not very elegant ways of using for_each, for instance:

Compute the index by using pointer arithmetic on the address of the array element.
Implement my own bogus iterator in the spirit of boost's counting_range.

Is there a better way to do this?

回答1:

std::ranges should be able to help if you have access to c++20, you can iterate over the indexes rather than your data:

#include <ranges>
#include <vector>
#include <algorithm>
#include <iostream>

int main() {
    std::vector<int> d(100);
    std::ranges::iota_view indexes((size_t)0, d.size());
    std::for_each(indexes.begin(), indexes.end(), [&d](size_t i)
    {
        std::cout << i << "," << d[i] << "\n";
    });
    return 0;
}

回答2:

You should be able to iterate over the indexes rather than the items. I think C++20 std::ranges gives you an easy way to do this, or you can use one of the Boost range methods. I'm not sure why you would consider rolling your own in the spirit of Boost counting_range when you could just, well, use Boost :-)

Having said that, I've actually opted for that roll-your-own approach, simply to make the code self-contained with neither C++20 nor Boost: feel free to replace paxrange with one of the other methods depending on your needs:

#include <iostream>
#include <algorithm>

// Seriously, just use Boost :-)

class paxrange {
    public:
        class iterator {
            friend class paxrange;
            public:
                long int operator *() const { return value; }
                const iterator &operator ++() { ++value; return *this; }
                iterator operator ++(int) { iterator copy(*this); ++value; return copy; }

                bool operator ==(const iterator &other) const { return value == other.value; }
                bool operator !=(const iterator &other) const { return value != other.value; }

            protected:
                iterator(long int start) : value (start) { }

            private:
                unsigned long value;
        };

        iterator begin() const { return beginVal; }
        iterator end() const { return endVal; }
        paxrange(long int  begin, long int end) : beginVal(begin), endVal(end) {}
    private:
        iterator beginVal;
        iterator endVal;
};
int main() {
    // Create a source and destination collection.

    std::vector<int> s;
    s.push_back(42); s.push_back(77); s.push_back(144);
    s.push_back(12); s.push_back(6);
    std::vector<int> d(5);

    // Shows how to use indexes with multiple collections sharing index.

    auto process = [s, &d](const int idx) { d[idx] = s[idx] + idx; };
    paxrange x(0, d.size());
    std::for_each(x.begin(), x.end(), process); // add parallelism later.

    // Debug output.

    for (const auto &item: s) std::cout << "< " << item << '\n';
    std::cout << "=====\n";
    for (const auto &item: d) std::cout << "> " << item << '\n';
}

The "meat" of the solution is the three lines in the middle of main(), where you set up a function for call-backs, one that takes the index rather than the item itself.

Inside that function, you use that index plus as many collections as needed, to set up the destination collection, very similar to what you desire.

In my case, I simply wanted the output vector to be the input vector but with the index added to each element, as per the output:

< 42
< 77
< 144
< 12
< 6
=====
> 42
> 78
> 146
> 15
> 10

回答3:

There is a simple header-only library in Github which might help you.

Your minimal example can be parallelized like this. However, presumably due to cache cooling, the runtime will not scale down linearly with the number of cores.

#include "Lazy.h"

double combine(double a, double b, double c)
{
    if (b > 0.5 && c < 0.4)
        return a + std::exp(b * c + 1);
    else if (b*c < 0.2)
        return a * 0.8 + (1-c) * (1-b);
    else
        return std::exp(1.0 / a) + b + c;
}

// Generate index split for parallel tasks
auto getIndexPairs(std::size_t N, std::size_t numSplits)
{
    std::vector<std::pair<std::size_t, std::size_t>> vecPairs(numSplits);
    double dFrom = 0, dTo = 0;
    for (auto i = 0; i < numSplits; ++i) {
        dFrom = dTo;
        dTo += N / double(numSplits);
        vecPairs[i] = {std::size_t(dFrom), std::min(std::size_t(dTo), N)};
    }
    vecPairs[numSplits-1].second = N;
    return vecPairs;
}

int main(int argc, char** argv) {
    const std::size_t N = 100000000;
    const std::size_t C = std::thread::hardware_concurrency(); // Number of parallel finder threads
    std::vector<double> d(N);
    std::vector<double> s(2*N);

    // Fill d and s with some values
    for (std::size_t i = 0; i < N; ++i) {
        s[i] = double(i) / N;
        s[i + N] = double(i + N) / N;
        d[i] = N - i;
    }
     
    // Run combine(...) in parallel in C threads
    Lazy::runForAll(getIndexPairs(N, C), [&](auto pr) {
        for (int i=pr.first; i<pr.second; ++i)
            d[i] = combine(d[i], s[2*i], s[2*i+1]);
        return nullptr; // Dummy return value
    });
}

来源：https://stackoverflow.com/questions/62828567/parallel-for-loop-over-range-of-array-indices-in-c17

标签

c++

parallel-processing

c++17