OpenMP performance

前端 未结 3 839
伪装坚强ぢ
伪装坚强ぢ 2020-12-13 14:33

Firstly, I know this [type of] question is frequently asked, so let me preface this by saying I\'ve read as much as I can, and I still don\'t know what the deal is.

相关标签:
3条回答
  • 2020-12-13 15:07

    It's hard to know for sure what is happening without significant profiling, but the performance curve seems indicative of False Sharing...

    threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time

    Great article on the topic at Dr Dobbs

    http://www.drdobbs.com/go-parallel/article/217500206?pgno=1

    In particular the fact that the routines are doing a lot of malloc/free could lead to this.

    One solution is to use a pool based memory allocator rather than the default allocator so that each thread tends to allocate memory from a different physical address range.

    0 讨论(0)
  • 2020-12-13 15:08

    Since the threads actually don’t interact, you could just change the code to multiprocessing. You would only have message passing in the end and it would be guaranteed that the threads don’t need to synchronize anything.

    Here’s python3.2-code which basically does that (you’ll likely want to not do it in python for performance reasons - or put the for-loop into a C-function and bind that via cython. You’ll see from the code why I show it in Python anyway):

    from concurrent import futures
    from my_cython_module import huge_function
    parameters = range(ntest)
    with futures.ProcessPoolExecutor(4) as e:
        results = e.map(huge_function, parameters)
        shared_array = list(results)
    

    That’s it. Increase the number of processes to the number of jobs you can put into the cluster and let each process just submit and monitor a job to scale to any number of calls.

    Huge functions without interaction and small input values almost call out for multiprocessing. And as soon as you have that, switching up to MPI (with almost unlimited scaling) is not too hard.

    From the technical side, AFAIK context switches in Linux are quite expensive (monolithic kernel with much kernel-space memory), while they are much cheaper on OSX or the Hurd (Mach microkernel). That might explain the huge amount of system time you see on Linux but not on OSX.

    0 讨论(0)
  • 2020-12-13 15:31

    So after some fairly extensive profiling (thanks to this great post for info on gprof and time sampling with gdb) which involved writing a big wrapper function to generate production level code for profiling, it became obvious that for the vast majority of the time when I aborted the running code with gdb and ran backtrace the stack was in an STL <vector> call, manipulating a vector in some way.

    The code passes a few vectors into the parallel section as private variables, which seemed to work fine. However, after pulling out all the vectors and replacing them with arrays (and some other jiggery-pokery to make that work) I saw a significant speed up. With small, artificial data sets the speed up is near perfect (i.e. as you double number of threads you half the time), while with real data sets the speed up isn't quite as good, but this makes complete sense as in the context of how the code works.

    It seems that for whatever reason (maybe some static or global variables deep in the STL<vector> implementation?) when there are loops moving through hundreds of thousands of iterations in parallel there is some deep level locking, which occurs in Linux (Ubuntu 12.01 and CentOS 6.2) but not in OSX.

    I'm really intrigued as to why I see this difference. Could it be difference in how the STL is implemented (OSX version was compiled under GNU GCC 4.7, as were the Linux ones), or is this to do with context switching (as suggested by Arne Babenhauserheide)

    In summary, my debugging process was as followed;

    • Initial profiling from within R to identify the issue

    • Ensured there were no static variables acting as shared variables

    • Profiled with strace -f and ltrace -f which was really helpful in identifying locking as the culprit

    • Profiled with valgrind to look for any errors

    • Tried a variety of combinations for the schedule type (auto, guided, static, dynamic) and chunk size.

    • Tried binding threads to specific processors

    • Avoided false sharing by creating thread-local buffers for values, and then implement a single synchronization event at the end of the for-loop

    • Removed all the mallocing and freeing from within the parallel region - didn't help with the issue but did provide a small general speedup

    • Tried on various architectures and OSses - didn't really help in the end, but did show that this was a Linux vs. OSX issue and not a supercomputer vs. desktop one

    • Building a version which implements concurrency using a fork() call - having the workload between two processes. This halved the time on both OSX and Linux, which was good

    • Built a data simulator to replicate production data loads

    • gprof profiling

    • gdb time sampling profiling (abort and backtrace)

    • Comment out vector operations

    • Had this not worked, Arne Babenhauserheide's link looks like it may well have some crucial stuff on memory fragmentation issues with OpenMP

    0 讨论(0)
提交回复
热议问题