Multiple threads and CPU cache

前端 未结 4 1999
清歌不尽
清歌不尽 2020-12-02 18:05

I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thr

相关标签:
4条回答
  • 2020-12-02 18:12

    The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.

    For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.

    Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.


    A few comments have asked about how to go about avoiding this problem.

    At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).

    To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.

    If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int is 4 bytes, then each cacheline will contain exactly 8 int values. As long as the array starts on a cacheline boundary, then output[0] through output[7] will be on one cache line, and output[8] through output[15] on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int values that is a multiple of 8.

    If you are storing complicated struct types rather than plain int, the pahole utility will be of use. It will analyse the struct types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your structs using this output - for example, you may want to manually add some padding so that your struct is a multiple of the cache line size.

    On POSIX systems, the posix_memalign() function is useful for allocating a block of memory with a specified alignment.

    0 讨论(0)
  • 2020-12-02 18:16

    In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.

    In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU. Herb Sutters "Eliminate False Sharing" demonstrates this very well.

    Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.

    0 讨论(0)
  • 2020-12-02 18:22

    I might be mistaking, but whether the core's cache is shared or not depends on the implementation of the CPU. You'd have to look up the technical sheets on the manufacturer's page to check whether each core in your CPU has their own cache or whether the cache was shared.

    I was working on image manipulation as well for a security company and sometimes we got corrupted images after running batch operations on threads. After long investigations we came to the conclusion that the cache was shared between CPU Core's and that in rare cases the data was beeing overwritten or replaced with incorrect data.

    Whether this is something to keep into account or is rather a rare event I cannot anwser.

    0 讨论(0)
  • 2020-12-02 18:30

    Intel documentation

    Intel publishes per-generation datasheets which may contain this kind of information.

    For example, for the processor i5-3210M which I had on my older computer, I look up the 3rd generation - Datasheet Volume 1 3.3 "Intel Hyper-Threading Technology (Intel HT Technology)" says:

    The processor supports Intel Hyper-Threading Technology (Intel HT Technology) that allows an execution core to function as two logical processors. While some execution resources such as caches, execution units, and buses are shared, each logical processor has its own architectural state with its own set of general-purpose registers and control registers.

    which confirms that caches are shared in a given hyperthread for that generation of CPUs.

    See also:

    • similar question for cache sharing across cores: How are cache memories shared in multicore Intel CPUs?
    • further analysis of threads vs cores: https://superuser.com/questions/133082/what-is-the-difference-between-hyper-threading-and-multiple-cores/995858#995858
    • the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?
    0 讨论(0)
提交回复
热议问题