____cacheline_aligned_in_smp for structure in the Linux kernel

In the Linux kernel, why do many structures use the ____cacheline_aligned_in_smp macro? Does it help increase performance when accessing the structure? If yes then how?

Murthy Munna

Each cache line in any cache (dcache or icache) is 64 bytes (in x86) architecture. Cache alignment is required to avoid false sharing of cache lines. If the cache lines are shared between global variables (happens more in kernel) If one of the global variables changed by one of the processor in its cache then it marks that cache line as dirty. In remaining CPU cache line it becomes stale entry, which needs to be flushed and re-fetched from memory. This might lead to cache line misses, which requires more CPU cycles. This reduces the performance of the system. Remember this is for global variables. Most of the kernel data strucuters use this to avoid cache line misses.

____cacheline_aligned instructs the compiler to instantiate a struct or variable at an address corresponding to the beginning of an L1 cache line, for the specific architecture, i.e., so that it is L1 cache-line aligned. ____cacheline_aligned_in_smp is similar, but is actually L1 cache-line aligned only when the kernel is compiled in SMP configuration (i.e., with option CONFIG_SMP). These are defined in file include/linux/cache.h

These definitions are useful for variables (and data structures) that are not allocated dynamically, via some allocator, but are global, compiler-allocated variables (a similar effect can be accomplished by dynamic memory allocators that can allocate memory at specific alignment).

The reason for cache-line aligned variables is to manage the cache-to-cache transfers of these variables, by hardware cache coherence mechanisms, in SMP systems, so that their movement does not implicitly occur when other variables are moved. This is for performance critical code, where one expects contention in the access of variables by multiple cpus (cores). The usual problem one tries to avoid, in this case, is false sharing.

A variable's memory starting at the beginning of a cache line is half the work for this purpose; one also needs to "pack with it" only variables that should move together. An example is an array of variables, where each element of the array is to be accessed by only one cpu (core):

struct my_data {
   long int a;
   int b;
} ____cacheline_aligned_in_smp cpu_data[ NR_CPUS ];

This kind of definition will require from the compiler (in an SMP configuration of the kernel) that each cpu's struct will begin at a cache line boundary. The compiler will, implicitly, allocate extra space after each cpu's struct, so that the next cpu's struct will begin at a cache line boundary, also.

An alternative is to pad the data structure with a cache line's size of dummy, unused bytes:

struct my_data {
   long int a;
   int b;
   char dummy[L1_CACHE_BYTES];
} cpu_data[ NR_CPUS ];

In this case, only dummy, unused data will be moved unintentionally and those actually accessed by each cpu will only move from cache to memory and vise versa, due to cache capacity misses.

Linux manages the CPU Cache in a very similar fashion to the TLB. CPU caches, like TLB caches, take advantage of the fact that programs tend to exhibit a locality of reference. To avoid having to fetch data from main memory for each reference, the CPU will instead cache very small amounts of data in the CPU cache. Frequently, there is two levels called the Level 1 and Level 2 CPU caches. The Level 2 CPU caches are larger but slower than the L1 cache, but Linux only concerns itself with the Level 1 or L1 cache.

CPU caches are organised into lines. Each line is typically quite small, usually 32 bytes and each line is aligned to it's boundary size. In other words, a cache line of 32 bytes will be aligned on a 32 byte address. With Linux, the size of the line is L1_CACHE_BYTES which is defined by each architecture.

How addresses are mapped to cache lines vary between architectures but the mappings come under three headings, direct mapping, associative mapping and set associative mapping. Direct mapping is the simplest approach where each block of memory maps to only one possible cache line. With associative mapping, any block of memory can map to any cache line. Set associative mapping is a hybrid approach where any block of memory can map to any line but only within a subset of the available lines.

Regardless of the mapping scheme, they each have one thing in common, addresses that are close together and aligned to the cache size are likely to use different lines. Hence Linux employs simple tricks to try and maximise cache usage

Frequently accessed structure fields are at the start of the structure to increase the chance that only one line is needed to address the common fields;
Unrelated items in a structure should try to be at least cache size bytes apart to avoid false sharing between CPUs;
Objects in the general caches, such as the mm_struct cache, are aligned to the L1 CPU cache to avoid false sharing.

If the CPU references an address that is not in the cache, a cache miss occurs and the data is fetched from main memory. The cost of cache misses is quite high as a reference to cache can typically be performed in less than 10ns where a reference to main memory typically will cost between 100ns and 200ns. The basic objective is then to have as many cache hits and as few cache misses as possible.

Just as some architectures do not automatically manage their TLBs, some do not automatically manage their CPU caches. The hooks are placed in locations where the virtual to physical mapping changes, such as during a page table update. The CPU cache flushes should always take place first as some CPUs require a virtual to physical mapping to exist when the virtual address is being flushed from the cache.

More Information here

来源：https://stackoverflow.com/questions/25947962/cacheline-aligned-in-smp-for-structure-in-the-linux-kernel

标签

linux-kernel

smp