How many memory latency cycles per memory access type in OpenCL/CUDA?
问题 I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency. I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure. Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does