When is padding for shared memory really required?

前端 未结 2 465
深忆病人
深忆病人 2020-12-30 17:47

I am confused by 2 documents from NVidia. \"CUDA Best Practices\" describes that shared memory is organized in banks, and in general in 32-bit mode each 4 bytes is a bank (t

相关标签:
2条回答
  • 2020-12-30 18:09

    Your understanding is false. Bank conflicts happen when threads from the same warp are accessing different values that reside in the same bank.

    From CUDA C Programming guide:

    To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

    However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

    Padding is used to avoid bank conflicts. When you know your shared memory access pattern you can determine how to pad your share memory array to avoid bank conflicts.

    For example if let's say you have __shared__ float x[32][32]; and each thread with thread index tid is accessing x like this somevariable = x[tid][0];. This will cause 32-way bank conflict because all the threads are accessing different values from the same bank.
    To avoid conflicts you pad the array in the first dimension with one more element: __shared__ float x[32][33];. That will completely eliminate bank conflicts because now each row will have an bank location that is offset by one against the previous row.

    0 讨论(0)
  • 2020-12-30 18:24

    You might be interested in this webinar from the NVIDIA CUDA webinar page Shared memory including banks are described also on slides 35-45 from this webinar.

    In general shared memory bank conflicts can occur any time two different threads are attempting to access (from the same kernel instruction) locations within shared memory for which the lower 4 (pre-cc2.0 devices) or 5 bits (cc2.0 and newer devices) of the address are the same. When a bank conflict does occur, the shared memory system serializes accesses to locations that are in the same bank, thus reducing performance. Padding attempts to avoid this for some access patterns. Note that for cc2.0 and newer, if all the bits are the same (i.e. same location) this does not cause a bank conflict.

    Pictorially, we can look at it like this:

    __shared__ int A[2048];
    int my;
    my = A[0]; // A[0] is in bank 0
    my = A[1]; // A[1] is in bank 1
    my = A[2]; // A[2] is in bank 2
    ...
    my = A[31]; // A[31] is in bank 31 (cc2.0 or newer device)
    my = A[32]; // A[32] is in bank 0
    my = A[33]; // A[33] is in bank 1
    

    now, if we access shared memory across threads in a warp, we may hit bank conflicts:

    my = A[threadIdx.x];    // no bank conflicts or serialization - handled in one trans.
    my = A[threadIdx.x*2];  // 2-way bank conflicts - will cause 2 level serialization
    my = A[threadIdx.x*32]; // 32-way bank conflicts - will cause 32 level serialization
    

    Let's take a closer look at the 2-way bank conflict above. Since we are multiplying threadIdx.x by 2, thread 0 accesses location 0 in bank 0 but thread 16 accesses location 32 which is also in bank 0, thus creating a bank conflict. For the 32-way example above, all the addresses correspond to bank 0. Thus 32 transactions to shared memory must occur to satisfy this request, as they are all serialized.

    So to answer the question, if I knew that my access patterns would be like this for example:

    my = A[threadIdx.x*32]; 
    

    Then I might want pad my data storage so that A[32] is a dummy/pad location, as is A[64], A[96] etc. Then I could fetch the same data like this:

    my = A[threadIdx.x*33]; 
    

    And get my data with no bank conflicts.

    Hope this helps.

    0 讨论(0)
提交回复
热议问题