Malloc performance in a multithreaded environment

后端 未结 3 1379
遥遥无期
遥遥无期 2021-01-11 20:05

I\'ve been running some experiments with the openmp framework and found some odd results I\'m not sure I know how to explain.

My goal is to create this huge matrix a

相关标签:
3条回答
  • 2021-01-11 20:27

    You are right about vector::resize() internally calling malloc. Implementation-wise malloc is fairly complicated. I can see multiple places where malloc can lead to contention in a multi-threaded environment.

    1. malloc probably keeps a global data structure in userspace to manage the user's heap address space. This global data structure would need to be protected against concurrent access and modification. Some allocators have optimizations to alleviate the number of times this global data structure is accessed... I don't know how far has Ubuntu come along.

    2. malloc allocates address space. So when you actually begin to touch the allocated memory you would go through a "soft page fault" which is a page fault which allows the OS kernel to allocate the backing RAM for the allocated address space. This can be expensive because of the trip to the kernel and would require the kernel to take some global locks to access its own global RAM resource data structures.

    3. the user space allocator probably keeps some allocated space to give out new allocations from. However, once those allocations run out the allocator would need to go back to the kernel and allocate some more address space from the kernel. This is also expensive and would require a trip to the kernel and the kernel taking some global locks to access its global address space management related data structures.

    Bottomline, these interactions could be fairly complicated. If you are running into these bottlenecks I would suggest that you simply "pre-allocate" your memory. This would involve allocating it and then touching all of it (all from a single thread) so that you can use that memory later from all your threads without running into lock contention at user or kernel level.

    0 讨论(0)
  • 2021-01-11 20:36

    Memory allocators are definitely a possible contention point for multiple threads.

    Fundamentally, the heap is a shared data structure, since it is possible to allocate memory on one thread, and de-allocate it on another. In fact, your example does exactly that - the "resize" will free memory on each of the worker threads, which was initially allocated elsewhere.

    Typical implementations of malloc included with gcc and other compilers use a shared global lock and work reasonably well across threads if memory allocation pressure is relatively low. Above a certain allocation level, however, threads will begin to serialize on the lock, you'll get excessive context switching and cache trashing, and performance will degrade. Your program is an example of something which is allocation heavy, with an alloc + dealloc in the inner loop.

    I'm surprised that an OpenMP compatible compiler doesn't have a better threaded malloc implementation? They certainly exist - take a look at this question for a list.

    0 讨论(0)
  • 2021-01-11 20:46

    Technically, the STL vector uses the std::allocator which eventually calls new. new in its turn calls the libc's malloc (for your Linux system).

    This malloc implementation is quite efficient as a general purpose allocator, is thread-safe, however it is not scalable (the GNU libc's malloc derives from Doug Lea's dlmalloc). There are numerous allocators and papers that improve upon dlmalloc to provide scalable allocation.

    I would suggest that you take a look at Hoard from Dr. Emery Berger, tcmalloc from Google and Intel Threading Building Blocks scalable allocator.

    0 讨论(0)
提交回复
热议问题