Could you please explain the differences between using both "L1 and L2" caches or "only L2" cache in CUDA programming? What should I expect in time execution? When could I expect smaller gpu time? When I enable both L1 and L2 caches or just enable L2? thanks
Typically you would leave both L1 and L2 caches enabled. You should try to coalesce your memory accesses as much as possible, i.e. threads within a warp should access data within the same 128B segment as much as possible (see the CUDA Programming Guide for more info on this topic).
Some programs are unable to be optimised in this manner, their memory accesses are completely random for example. For those cases it may be beneficial to bypass the L1 cache, thereby avoiding loading an entire 128B line when you only want, for example, 4 bytes (you'll still load 32B since that is the minimum). Clearly there is an efficiency gain: 4 useful bytes from 128 is improved to 4 from 32.
来源:https://stackoverflow.com/questions/10180949/cuda-programming-l1-and-l2-caches