Making some, but not all, (CUDA) memory accesses uncached

问题

I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO).

Can this be done...

For a single kernel individually?
At run time rather than at compile time?
For writes only rather than for reads and writes?

回答1:

Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load operation within a kernel [see here for details].
No, it is an instruction level feature of PTX. You can JIT a version of code containing non-caching memory loads at runtime, but that is still technically compilation. You could probably use some template tricks and separate compilation to get the runtime to hold two versions of the same code built with or without caching and choose between those versions at runtime. You could also use the same tricks to get two versions of a given kernel without or without inline PTX for uncached loads [see here for one possibility of achieving this]
These non-caching instructions bypass the L1 cache with byte level granularity to L2 cache. So they are load only (all writes invalidate L1 cache and store to L2).

回答2:

I don't know if it was possible before, but CUDA 8.0 gives you a possibility to fine-tune caching for specific reads/writes. See PTX manual for details.

For example, to make this code always go to the main memory on read:

const float4 val = input[i];

you could write the following:

float4 val;
const float4* myinput = input+i;
asm("ld.global.cv.v4.f32 {%0, %1, %2, %3}, [%4];" : "=f"(val.x), "=f"(val.y), "=f"(val.z), "=f"(val.w) : "l"(myinput));

I managed to speed up one of my cache-intensive kernels by about 20% using non-cached reads and writes for data that was accessed only once by design

来源：https://stackoverflow.com/questions/30420774/making-some-but-not-all-cuda-memory-accesses-uncached

标签

caching

cuda

gpgpu