performance problems in parallel mergesort C++

问题

I have tried to write a parallel implementation of mergesort using threads and templates. The relevant code is listed below.

I have compared the performance with sort from the C++ STL. My code is 6 times slower than std::sort when no threads are spawned. Playing with the variable maxthreads (and/or FACTOR) I was able to only double the performance, so that in the best case I am 3 times slower than std::sort. I have tried the code on a 16 core multiprocessor machine.

htop shows that the cores are used as expected, but why the lack in performance and I do not feel the parallelism in the overall runtime ?

Is there an error ?

Thank you for a reply.

#define FACTOR 1
static unsigned int maxthreads = FACTOR * std::thread::hardware_concurrency();

unsigned int workers=0;
std::mutex g_mutex;

template <typename T>
std::vector<T>* mergesort_inplace_multithreading(
    typename std::vector<T>::iterator* listbegin,
    typename std::vector<T>::iterator *listend,
    std::vector<T>* listarg)
{
    if (*listbegin == *listend)
    {
        return listarg;
    }
    else if (*listend == *listbegin + 1)
    {
        return listarg;
    }
    else
    {
        size_t offset = std::distance(*listbegin, *listend)/2;
        typename std::vector<T>::iterator listhalf = *listbegin + offset;
        g_mutex.lock();
        if (::workers <= maxthreads-2 and maxthreads >=2)
        {
            workers += 2;

            g_mutex.unlock();

            std::thread first_thread(mergesort_inplace_multithreading<T>, listbegin, &listhalf, listarg);
            std::thread second_thread(mergesort_inplace_multithreading<T>, &listhalf, listend, listarg);
            first_thread.join();
            second_thread.join();
            g_mutex.lock();
            workers -= 2;
            g_mutex.unlock();
        }
        else
        {
            g_mutex.unlock();
            mergesort_inplace_multithreading<T>(listbegin, &listhalf, listarg);
            mergesort_inplace_multithreading<T>(&listhalf, listend, listarg);
        }

        typename std::vector<T> result;
        typename std::vector<T>::iterator lo_sorted_it = *listbegin;
        typename std::vector<T>::iterator hi_sorted_it = listhalf;
        typename std::vector<T>::iterator lo_sortedend = listhalf;
        typename std::vector<T>::iterator hi_sortedend = *listend;
        while (lo_sorted_it != lo_sortedend and hi_sorted_it != hi_sortedend)
        {
            if (*lo_sorted_it <= *hi_sorted_it)
            {
                result.push_back(*lo_sorted_it);
                ++lo_sorted_it;
            }
            else
            {
                result.push_back(*hi_sorted_it);
                ++hi_sorted_it;
            }

        }//end while

        if (lo_sorted_it != lo_sortedend)
        {
            //assert(hi_sorted_it == hi_sortedend);
            result.insert(result.end(), lo_sorted_it, lo_sortedend);
        }
        else
        {
            //assert(lo_sorted_it == lo_sortedend);
            result.insert(result.end(), hi_sorted_it, hi_sortedend);
        }
        std::copy(result.begin(), result.end(), *listbegin);
        return listarg;
    }
}

int main()
{
    //some tests
}

回答1:

You don't need a mutex for parallel mergesort. And you certainly don't need to launch two threads for each split of partitions. You launch one thread; the second partition is handled on the current thread; a much better use of thread resources than one thread sitting doing nothing but waiting for two others to finish.

First, the simple test program, sorting 20-million unsigned integers. Note: All programs compiled with Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn), 64bit, posix threads, and optimizations set at O2

Test Program

int main()
{
    using namespace std::chrono;

    std::random_device rd;
    std::mt19937 rng(rd());
    std::uniform_int_distribution<unsigned int> dist(0, std::numeric_limits<unsigned int>::max());

    std::vector<unsigned int> v, back(20*1000000);

    for (int i=0; i<5; ++i)
    {
        std::cout << "Generating...\n";
        std::generate_n(back.begin(), back.size(), [&](){return dist(rng);});

        time_point<system_clock> t0, t1;

        v = back;
        std::cout << "std::sort: ";
        t0 = system_clock::now();
        std::sort(v.begin(), v.end());
        t1 = system_clock::now();
        std::cout << duration_cast<milliseconds>(t1-t0).count() << "ms\n";

        v = back;
        std::cout << "mergesort_mt1: ";
        t0 = system_clock::now();
        mergesort_mt1(v.begin(), v.end());
        t1 = system_clock::now();
        std::cout << duration_cast<milliseconds>(t1-t0).count() << "ms\n";
    }

    return 0;
}

Parallel Mergesort

We start with something super-basic. We cap the number of concurrent threads to be the reported hardware concurrency from the standard library. Once we reach that limit, we stop issuing new threads and simply recurse on our existing ones. This trivial algorithm has surprisingly decent behavior once distributed across hardware-supported threads.

template<typename Iter>
void mergesort_mt1(Iter begin, Iter end,
                  unsigned int N = std::thread::hardware_concurrency()/2)
{
    auto len = std::distance(begin, end);
    if (len < 2)
        return;

    Iter mid = std::next(begin, len/2);
    if (N > 1)
    {
        auto fn = std::async(mergesort_mt1<Iter>, begin, mid, N-2);
        mergesort_mt1(mid, end, N-2);
        fn.wait();
    }
    else
    {
        mergesort_mt1(begin, mid, 0);
        mergesort_mt1(mid, end, 0);
    }

    std::inplace_merge(begin, mid, end);
}

Output

Generating...
std::sort: 1902ms
mergesort_mt1: 1609ms
Generating...
std::sort: 1894ms
mergesort_mt1: 1584ms
Generating...
std::sort: 1881ms
mergesort_mt1: 1589ms
Generating...
std::sort: 1840ms
mergesort_mt1: 1580ms
Generating...
std::sort: 1841ms
mergesort_mt1: 1631ms

This looks hopeful, but certainly it it can be improved.

Parallel Merge + Standard Library Sort

The std::sort algorithm varies widely from vendor to vendor in its implementation. The main restriction placed by the standard is it must have average complexity O(NlogN). To achieve this on the side of performance, many std::sort algorithms are some of the most complex, nutcase-optimized code you'll find in the standard library. I've perused some implementations that have several internal sort characteristics. One such implementation I've seen uses introsort (quicksort until recursion-depth is limited, then heapsort) for larger partitions, and once small partitions are reached, succumbs to a mammoth hand-unrolled 16-slot insertion-sort.

The point is, the standard library authors understand that one universal sort-algorithm simply does not fit all. Several are often employed for the task, often working together in harmony. Don't naively think you can beat them; rather, join them by exploiting their hard work.

Modifying our code is straightforward. We use std::sort for all partitions smaller than 1025. The rest is identical:

template<typename Iter>
void mergesort_mt2(Iter begin, Iter end,
                   unsigned int N = std::thread::hardware_concurrency())
{
    auto len = std::distance(begin, end);
    if (len <= 1024)
    {
        std::sort(begin,end);
        return;
    }

    Iter mid = std::next(begin, len/2);
    if (N > 1)
    {
        auto fn = std::async(mergesort_mt2<Iter>, begin, mid, N-2);
        mergesort_mt2(mid, end, N-2);
        fn.wait();
    }
    else
    {
        mergesort_mt2(begin, mid, 0);
        mergesort_mt2(mid, end, 0);
    }

    std::inplace_merge(begin, mid, end);
}

After adding our new test case to the test program, we get:

Output

Generating...
std::sort: 1930ms
mergesort_mt1: 1695ms
mergesort_mt2: 998ms
Generating...
std::sort: 1854ms
mergesort_mt1: 1573ms
mergesort_mt2: 1030ms
Generating...
std::sort: 1867ms
mergesort_mt1: 1584ms
mergesort_mt2: 1005ms
Generating...
std::sort: 1862ms
mergesort_mt1: 1589ms
mergesort_mt2: 1001ms
Generating...
std::sort: 1847ms
mergesort_mt1: 1578ms
mergesort_mt2: 1009ms

OK. Now we're seeing some impressive stuff. But can we squeeze even more out?

Parallel Merge+Standard Sort w/Limited Recursion

If you think about it, to fully exploit all that hard work std::sort was given, we can simply stop recursing once we reach full thread population. If that happens, just sort whatever we have with std::sort and merge things together when done. Hard as it is to believe, this will actually, reduce the code complexity. Our algorithm becomes one of simply spreading partitions across cores, each handled by std::sort when the time comes:

template<typename Iter>
void mergesort_mt3(Iter begin, Iter end,
                   unsigned int N = std::thread::hardware_concurrency()/2)
{
    auto len = std::distance(begin, end);
    if (len <= 1024 || N < 2)
    {
        std::sort(begin,end);
        return;
    }

    Iter mid = std::next(begin, len/2);
    auto fn = std::async(mergesort_mt3<Iter>, begin, mid, N-2);
    mergesort_mt3(mid, end, N-2);
    fn.wait();
    std::inplace_merge(begin, mid, end);
}

Once again, after adding this to our test loop...

Output

Generating...
std::sort: 1911ms
mergesort_mt1: 1656ms
mergesort_mt2: 1006ms
mergesort_mt3: 802ms
Generating...
std::sort: 1854ms
mergesort_mt1: 1588ms
mergesort_mt2: 1008ms
mergesort_mt3: 806ms
Generating...
std::sort: 1836ms
mergesort_mt1: 1580ms
mergesort_mt2: 1017ms
mergesort_mt3: 806ms
Generating...
std::sort: 1843ms
mergesort_mt1: 1583ms
mergesort_mt2: 1006ms
mergesort_mt3: 853ms
Generating...
std::sort: 1855ms
mergesort_mt1: 1589ms
mergesort_mt2: 1012ms
mergesort_mt3: 798ms

As written, for any partition that is 1024 items or smaller, we just delegate to std::sort. If the partitions are larger, we introduce a new thread to handle one side of the split partition, using the current thread to handle the other. Once we saturate the thread limit N, we stop splitting and simply delegate everything to std::sort no matter what. In short, we're a multi-thread distribution front-end to std::sort.

Summary

There are still-more bullets in the chamber that we could fire (using some meta-programming and assuming a fixed concurrency pool number), but that I leave to you.

You can dramatically increase your sorting performance if you simply focus on partitioning, distributing to threads until tapped, utilize a highly optimized sorting algorithm for the floor-partitions, then stitch things together to finish the job. Is there still room for improvement? Certainly. But in its simplest form presented above there is no locking, no mutexes, etc. The difference between the final sample and bare std::sort is a whopping 58% improvement on identical data sets on a puny little MacBook Air mid-2011 with 4gB RAM and a duo-core i7 processor. This is impressive, and considering how little code it took to do it, just-plain f'ing awesome.

回答2:

Thank you for the reply.

The mutex only protects the unsigned int workers (a global variable) that keeps track of how many threads was spawned. If a maximum is reached (given by maxthreads) then no more threads are generated. You achive this using the parameter N in mergesort_mt2.

How many cores did your machine have ?

Still the performance only seems to double...

来源：https://stackoverflow.com/questions/24130307/performance-problems-in-parallel-mergesort-c

标签

c++

multithreading

performance

recursion

mergesort