问题
I am confused by 2 documents from NVidia. "CUDA Best Practices" describes that shared memory is organized in banks, and in general in 32-bit mode each 4 bytes is a bank (that is how I understood it). However Parallel Prefix Sum (Scan) with CUDA goes into details how padding should be added to scan algorithm because of bank conflicts.
The problem for me is, the basic type for this algorithm as presented is float and its size is 4 bytes. Thus each float is a bank and there is no bank conflict.
So is my understanding correct -- i.e. if you work on 4*N-byte types you don't have to worry about bank conflicts because by definition there will be none? If no, how should I understand it (when to use padding)?
回答1:
You might be interested in this webinar from the NVIDIA CUDA webinar page Shared memory including banks are described also on slides 35-45 from this webinar.
In general shared memory bank conflicts can occur any time two different threads are attempting to access (from the same kernel instruction) locations within shared memory for which the lower 4 (pre-cc2.0 devices) or 5 bits (cc2.0 and newer devices) of the address are the same. When a bank conflict does occur, the shared memory system serializes accesses to locations that are in the same bank, thus reducing performance. Padding attempts to avoid this for some access patterns. Note that for cc2.0 and newer, if all the bits are the same (i.e. same location) this does not cause a bank conflict.
Pictorially, we can look at it like this:
__shared__ int A[2048];
int my;
my = A[0]; // A[0] is in bank 0
my = A[1]; // A[1] is in bank 1
my = A[2]; // A[2] is in bank 2
...
my = A[31]; // A[31] is in bank 31 (cc2.0 or newer device)
my = A[32]; // A[32] is in bank 0
my = A[33]; // A[33] is in bank 1
now, if we access shared memory across threads in a warp, we may hit bank conflicts:
my = A[threadIdx.x]; // no bank conflicts or serialization - handled in one trans.
my = A[threadIdx.x*2]; // 2-way bank conflicts - will cause 2 level serialization
my = A[threadIdx.x*32]; // 32-way bank conflicts - will cause 32 level serialization
Let's take a closer look at the 2-way bank conflict above. Since we are multiplying threadIdx.x
by 2, thread 0 accesses location 0 in bank 0 but thread 16 accesses location 32 which is also in bank 0, thus creating a bank conflict. For the 32-way example above, all the addresses correspond to bank 0. Thus 32 transactions to shared memory must occur to satisfy this request, as they are all serialized.
So to answer the question, if I knew that my access patterns would be like this for example:
my = A[threadIdx.x*32];
Then I might want pad my data storage so that A[32]
is a dummy/pad location, as is A[64]
, A[96]
etc.
Then I could fetch the same data like this:
my = A[threadIdx.x*33];
And get my data with no bank conflicts.
Hope this helps.
回答2:
Your understanding is false. Bank conflicts happen when threads from the same warp are accessing different values that reside in the same bank.
From CUDA C Programming guide:
To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.
However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.
Padding is used to avoid bank conflicts. When you know your shared memory access pattern you can determine how to pad your share memory array to avoid bank conflicts.
For example if let's say you have __shared__ float x[32][32];
and each thread with thread index tid is accessing x like this somevariable = x[tid][0];
. This will cause 32-way bank conflict because all the threads are accessing different values from the same bank.
To avoid conflicts you pad the array in the first dimension with one more element: __shared__ float x[32][33];
. That will completely eliminate bank conflicts because now each row will have an bank location that is offset by one against the previous row.
来源:https://stackoverflow.com/questions/15056842/when-is-padding-for-shared-memory-really-required