问题
Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.
I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.
回答1:
There are arguably at least two errors in that posted code:
This can't possibly be a correct range check:
if x >= C.shape[0] and y >= C.shape[1]:
In order for us to decide that a particular thread in the grid not do any loading activity, we require either that
x
is out of range or thaty
is out of range. Theand
should have been anor
.It is illegal to use
cuda.syncthreads()
in conditional code, if all the threads in the block cannot participate in that statement. The previousreturn
statement in item 1 above (even if corrected fromand
toor
) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.
Therefore, to fix these issues, we cannot use just a simple return
statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A
or B
) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C
.
The following code has those items fixed. It seems to work correctly for your given test case:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = 0
sB[tx, ty] = 0
if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
sA[tx, ty] = A[x, ty + i * TPB]
if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
if x < C.shape[0] and y < C.shape[1]:
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$
(Note the use of and
here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether as set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).
I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)
As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#%%
x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$
I've also reordered the sense of x
and y
(and usage of tx
and ty
) to fix a performance issue in the above code.
Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.
来源:https://stackoverflow.com/questions/64197780/how-to-generalize-fast-matrix-multiplication-on-gpu-using-numba