I\'ve always worked with linear shared memory (load, store, access neighbours) but I\'ve made a simple test in 2D to study bank conflicts which results have confused me.
Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:
bank0 .... bank15
row 0 [ 0 .... 15 ]
1 [ 16 .... 31 ]
2 [ 32 .... 47 ]
3 [ 48 .... 63 ]
4 [ 64 .... 79 ]
5 [ 80 .... 95 ]
6 [ 96 .... 111 ]
7 [ 112 .... 127 ]
8 [ 128 .... 143 ]
9 [ 144 .... 159 ]
10 [ 160 .... 175 ]
11 [ 176 .... 191 ]
12 [ 192 .... 207 ]
13 [ 208 .... 223 ]
14 [ 224 .... 239 ]
15 [ 240 .... 255 ]
col 0 .... col 15
Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?
The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:
shData[threadIdx.x][threadIdx.y]
threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:
shData[threadIdx.y][threadIdx.x]
threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.