How do non temporal instructions work?

问题

I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:

#include <emmintrin.h>
void setbytes(char *p, int c)
{
    __m128i i = _mm_set_epi8(c, c, c, c,
    c, c, c, c,
    c, c, c, c,
    c, c, c, c);
    _mm_stream_si128((__m128i *)&p[0], i);
    _mm_stream_si128((__m128i *)&p[16], i);
    _mm_stream_si128((__m128i *)&p[32], i);
    _mm_stream_si128((__m128i *)&p[48], i);
}

With such a comment right below it:

Assuming the pointer p is appropriately aligned, a call to this function will set all bytes of the addressed cache line to c. The write-combining logic will see the four generated movntdq instructions and only issue the write command for the memory once the last instruction has been executed. To summarize, this code sequence not only avoids reading the cache line before it is written, it also avoids polluting the cache with data which might not be needed soon.

What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?

回答1:

When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.

When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.

The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.

The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware): - Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.

The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.

Note that if the cache line written is already in the main caches, the above method will also update them.

If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.

If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).

Typically the non-temporal cache size is on the order of 4-8 cache lines.

To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.

回答2:

I think this is partly a terminology question: The passage you quote from Ulrich Drepper's article isn't talking about cached data. It's just using the term "cache line" for an aligned 64B block.

This is normal, and especially useful when talking about a range of hardware with different cache-line sizes. (Earlier x86 CPUs, as recently as PIII, had 32B cache lines, so using this terminology avoids hard-coding that microarch design decision into the discussion.)

A cache-line of data is still a cache-line even if it's not currently hot in any caches.

回答3:

I don't have references under my fingers to prove what I am saying, but my understanding is this: the only unit of transfer over the memory bus is cache lines, whether they go into the cache or to some special registers. So indeed, the code you pasted fills a cache line, but it is a special cache line that does not reside in cache. Once all bytes of this cache line have been modified, the cache line is send directly to memory, without passing through the cache.

来源：https://stackoverflow.com/questions/14106477/how-do-non-temporal-instructions-work

标签

caching

memory

x86

intrinsics