Does accessing a single struct member pull the entire struct into the Cache?

问题

I've been reading Ulrich Drepper's, "What every programmer should know about memory" and in section 3.3.2 Measurements of Cache Effects ( halfway down the page ) it gives me the impression that accessing any member of a struct causes the whole struct to get pulled into the CPU cache.

Is this correct? If so, how does the hardware know about the layout of these structs? Or does the code generated by the compiler somehow force the entire struct to be loaded?

Or are the slowdowns from using larger structs primarily due to TLB misses caused by the structs being spread out across more memory pages?

The example struct used by Drepper is:

  struct l {
    struct l *n;
    long int pad[NPAD];
  };

Where sizeof(l) is determined by NPAD equals 0, 7, 15 or 31 resulting in structs that are 0, 56, 120, and 248 bytes apart and assuming cache lines that are 64 bytes and 4k pages.

Just iterating through the linked list gets significantly slower as the struct grows, even though nothing other than the pointer is actually being accessed.

回答1:

The hardware does not know at all about the struct. But it is true that the hardware load in the cache some bytes around the bytes you are actually accessing. This is because the cache line has a size. It does not work on a byte by byte access but on e.g. 16 bytes size at a time.

You have to be careful when ordering the members of the struct so that often used members are close to each other. For instance if you have the following struct:

struct S {
  int foo;
  char name[64];
  int bar;
};

If the member variables foo and bar are used very often, the hardware will load in cache the bytes around foo, and when you'll access bar, it will have to load the bytes around bar. Even if these bytes around foo and around bar are never used. Now rewrite your struct as follows:

struct S {
  int foo;
  int bar;
  char name[64];
};

When you'll use foo, the hardware will load in cache the bytes around foo. When you'll use bar, bar will already be in the cache because bar is contained in the bytes around foo. The CPU won't have to wait for bar to be in the cache.

Answer is: accessing a single struct member does not pull the entire struct in the cache, but pull some other member of the struct into the cache.

回答2:

The hardware doesn't know the layout of the struct, but just loads a number of bytes around the accessed member into cache. And yes, the slowdown from larger structs is because they will then be spread across more cache lines.

回答3:

Accessing a struct member causes no more of a performance penalty than accessing any other area in memory. In fact, there may be a performance improvement if you access several struct members in the same area, since other members might be cached by the first access.

回答4:

Usually, the L1 cache uses virtual addresses, if you access a member of a struct, a specific amount of bytes gets in the cache (one cache line, size usually between 8 and 512 bytes). Since all struct members are aligned side by side in memory, the chance that the whole structure gets in the cache is somewhat big (depends on sizeof(struct your_struct))...

回答5:

While the CPU can happily deal with loads and stores as small as one byte, caches only ever deal with "cacheline" sized data. In computer architecture textbooks this is also known as the "block size."

On most systems this is 32 or 64 bytes. It can be different from one CPU to the next, and even sometimes from one cache level to the next.

Additionally, some CPUs perform speculative prefetching; this means if you access cachelines 5 and 6 in sequence, it will attempt to load cacheline 7 without you asking for it.

回答6:

"Just iterating through the linked list gets significantly slower as the struct grows, even though nothing other than the pointer is actually being accessed."

With NPAD = 0, each cache line contains 8 list nodes, so you can see why that's fastest.

With NPAD = 7, 15, 31, only one cache line needs to be loaded for each list node, and you might expect them all to be the same speed - one cache miss per node. But a modern memory manager will be doing speculative caching. If it has spare capacity (which it probably does, because with modern memory it can perform multiple reads in parallel to main memory), then it will start loading memory close to the memory you're using. Although it's a linked list, if you constructed it in any of the obvious ways then there is a good chance that you're accessing memory in sequence. So the closer together in memory your lists nodes are, the more successful the cache is likely to be in terms of already having what you need.

In the worst possible scenario, where your memory is being pulled in from swap as you use it, your program would be limited by disk I/O. It's possible that your rate of progress through the list would be entirely determined by how many nodes there are per page, and you might see time taken being directly proportional to the size of the node, up to 4k. I haven't tried it, though, and the OS will be clever with swap just as the MMU is clever with main memory, so it's not necessarily that simple.

来源：https://stackoverflow.com/questions/1940782/does-accessing-a-single-struct-member-pull-the-entire-struct-into-the-cache

标签

Linux

memory

caching