Dynamic matrix multiplication with CUDA

前端未结

关注

 1  1156

The idea of my simple program that I\'ve been trying to write is to take input from the user to see how large of a matrix to multiply.

I am looking to take the input

相关标签:

1条回答

难免孤独

2020-12-21 15:34
This isn't a very clear question, so this answer is something of a guess based on what you have previous asked in several rather similar questions earlier.

A good starting point to understanding how to do this sort of operation is to go back to the beginning and think about the matrix-matrix multiplication problem from first principles. You are interested in code to calculate the dot product of two matrices, C = AB. The restriction you have is that the kernel you are using can only compute products of matrices which are round multiples of some internal block size. So what can you do?

One way to look at the problem is to imagine that A and B matrices were block matrices. The matrix multiply can be written like this:

and the resulting matrix C can then by formed by combinations of the products of the eight submatrices in A and B:

It might not be immediately obvious how this helps solve the problem, but let's consider a concrete example:
1. You have an optimal matrix multiplication kernel which uses an internal block size of 32, and is only correct when matrices are round multiples of that block size.
2. You have a pair of 1000x1000 square matrices to multiply.
These first facts implies that your kernel can only correctly solve either a 1024x1024 product, or a 992x992 product, but not the 1000x1000 operation you need.

If you decide to use a 1024x1024 product, you can use the block decomposition idea to formulate the problem like this:

where O_nn denotes a suitably sized matrix of zeros. Now you have a pair of 1024x1024 matrices, and their product will result in

ie. the left hand, upper block is a 1000x1000 matrix containing AB. This is effectively zero padding to achieve the correct result. In this example, it means that about 7% more computation is performed than is required. Whether than is important or not is probably application specific.

The second approach would be to use the basic kernel to compute a 992x992 product, then work out a strategy to deal with the other seven products in the block decomposed version of the calculation, something like this:

with A₁₁ and B₁₁ being 992x992 matrices, and O_nn are zero matrices as before. At first inspection this doesn't look very helpful, but it is worth remembering that all the calculations to make the right hand side matrix contain only about 1.2% of the total computations required to compute the matrix product. They could easily be done on the host CPU while the GPU is doing the main calculation, and then added to the GPU result to form the final matrix. Because the CUDA API is asynchronous, most of that host calculation can be completely hidden and is effectively free.

This answer contains two strategies for doing what it is you are asking for without changing more than single line of your current kernel code. There is obviously a third way, which is to more radically modify the kernel itself, but that is something you should try yourself first and then ask for help if your solution doesn't work.
0 讨论(0)
发布评论:

提交评论
- 加载中...