When writing data to a PCIe device, it is possible to use a write-combining mapping to hint the CPU that it should generate 64-byte TLPs towards the device.
Is it possible to do something similar for reads? Somehow hint the CPU to read an entire cache line or a larger buffer instead of reading one word at a time?
Intel has a white-paper on copying from video RAM to main memory; this should be similar but a lot simpler (because the data fits in 2 or 4 vector registers).
It says that NT loads will pull a whole cache-line of data from WC memory into a LFB:
Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.
Use AVX2 _mm256_stream_load_si256()
or the SSE4.1/AVX1 128-bit version.
Fill-buffers are a limited resource, so you definitely want the compiler to generate asm that does the two aligned loads of a 64-byte cache-line back to back, then store to regular memory.
If you're doing more than one 64-byte block at a time, see Intel's white-paper for a suggestion on using a small bounce buffer that stays hot in L1d to avoid mixing stores to DRAM with NT loads. (L1d evictions to DRAM, like NT stores, also require line-fill buffers, LFBs).
Note that _mm256_stream_load_si256()
is not useful at all on memory types other than WC. The NT hint is ignored on current hardware, but it costs an extra ALU uop anyway vs. a regular load. There is prefetchnta
, but that's a totally different beast.
Intel posted a white paper on how to do 64B PCIe transfers: .
The principles are:
- Map the region as WC.
Use the following code to write 64B
_mm256_store_si256(pcie_memory_address, ymm0); _mm256_store_si256(pcie_memory_address+32, ymm1); _mm_mfence();
Where _mm256_store_si256
is the intrinsic of (v)movdqa
and the mfence
is used to order the stores with newer ones and flush the WC buffer.
As for my limited understanding of the WC part of the cache subsystem, there are a number of assumptions:
The CPU writes a WC buffer as a burst-transaction only if the WC buffer is full:
The only elements of WC propagation to the system bus that are guaranteed are those provided by transaction atomicity. For example, with a P6 family processor, a completely full WC buffer will always be propagated as a single 32-bit burst transaction using any chunk order. In a WC buffer eviction where data will be evicted as partials, all data contained in the same chunk (0 mod 8 aligned) will be propagated simultaneously.
So one must be sure to use an empty WC buffer otherwise a 32B transaction will be made and, even worst, the upper chunk may be written before the lower one.
There is a practical experimentation on the Intel's forum using an FPGA where the WC buffer is sometimes flushed prematurely.
The WC cache type ensures the core writes a burst-transaction but the uncore must also be able to handle this transaction as a whole.
Particularly, after the subtractive decoding, the Root complex must be able to process it as a 64B transaction.
From the same forum post of above, it seems that the uncore is able to coalesce sequential WC writes into a single TLP but playing with the write ordering (e.g. swapping the two _mm256_store_si256
or leaving a hole for sizes smaller than 64B) may fall out of the Root Complex capabilities.