问题
I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS).
I found studies where they got speedup >200x ( >1000 Gflops ). ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU to RAM and it is just ~10% of the matrix multiplication time.
#define BLOCK_SIZE 22
__kernel void matrixMul(
__global float* Cij,
__global float* Aik,
__global float* Bkj,
__const int ni,
__const int nj,
__const int nk
){
// WARRNING : interchange of i and j dimension lower the performance >2x on my nV GT275 GPU
int gj = get_global_id(0); int gi = get_global_id(1);
int bj = get_group_id(0); int bi = get_group_id(1); // Block index
int tj = get_local_id(0); int ti = get_local_id(1); // Thread index
int oj = bi*BLOCK_SIZE; int oi = bj*BLOCK_SIZE;
float Csub =0;
__local float As [BLOCK_SIZE][BLOCK_SIZE];
__local float Bs [BLOCK_SIZE][BLOCK_SIZE];
for (int ok = 0; ok < nk; ok += BLOCK_SIZE ) {
As[ti][tj] = Aik[ nk*(gi ) + tj + ok ]; // A[i][k]
Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ]; // B[k][j]
barrier(CLK_LOCAL_MEM_FENCE);
for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj];
barrier(CLK_LOCAL_MEM_FENCE);
}
Cij[ nj * ( gi ) + gj ] = Csub;
}
NOTE - the strange BLOCK_SIZE=22 is the maximum BLOCK_SIZE which does fit to max work_group_size which is 512 on my GPU. In this code must hold condition BLOCK_SIZE^2 < max work_group_size. 22=int(sqrt(512)). I tried also BLOCK_SIZE=16 or 8 but it was slower tan with 22.
I also tried simple matrixMul (without using local memory) but it was even 10-times slower than numpy.dot(). I copied the code here http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html they say that even the simple version (without local memory) should run 200x faster than CPU? I don't undrestand that.
the dependecne of performance in my case is:
N = 220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464
N = 330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205
N = 440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548
N = 550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217
N = 660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522
N = 770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638
N = 880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152
N = 990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478
N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625
N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581
N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979
NOTE - the "Gflops" number is obtained as N^3/time and it does include time required to copy results from GPU to main memory, but this time is just few percent of total time especially for N>1000
maybe more pictorial is time in secons:
N = 220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000
N = 330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683
N = 440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565
N = 550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957
N = 660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298
N = 770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638
N = 880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097
N = 990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581
N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065
N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581
N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979
I was thinking that maybe some speed improvement can be obtained using async_work_group_copy or even read_imageui to copy blocks to local memory. But I don't understand why I have so big difference when I'm using basically the same code as people who say they have 200x speedup?????
回答1:
Without even looking at your code let me make some comments about your benchmarks. Let's ignore numpy and compare the maximum SP FLOPs/s and DP FLOPs/s of an Intel CPU versus Nvidia and AMD GPUs.
A Intel 2600K at 4 GHz can do 4 GHz * (8 AVX) * (2 ILP) * ( 4 cores) = 256 SP GFLOPs/s. For DP it's half: 128 DP GFLOPs/s. Haswell which comes out in a few weeks will double both of those. The Intel MKL library gets better than 80% efficiency in GEMM. My own GEMM code gets 70% on my i7-2700 so the 5 GFlops/s you quote with numpy is tiny and not fair to compare with.
I don't know what the GTX 275 is capable of but I would guess it's much more than 50 GFLOPs/s.
The article you reference compares a AMD 7970. They get 848 (90% efficiency) DP GFlops/s and 2646 (70% efficiency) SP GFlops/s. That's closer to 10x the performance of the CPU not 200x!
Edit: Your calculations of FLOPs is wrong it should be 2.0*n^3. That's still approximate but it's asymptotically true. Let me explain.
Consider a 3D dot product. It's x1*x2+y1*y2+z1*z2. That's 3 multiplications and two additions. So a N-dimensional dot product is n multiplications and (n-1) additions. A matrix product is equivalent to nxn dot products, i.e. n*n*n multiplications and n*n*(n-1) additions. That's approximately 2.0*n^3 FLOPS. So you should double all your Gflops/s numbers.
Edit: You might want to consider the kernel time. It's been awhile since I used OpenCL but using the C++ bindings I did something like this
queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
//other code...run kernel
time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>();
time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
回答2:
A good GPU matrix-multiply does not just use local memory, it stores blocks of A, B, and/or C in registers (which results in higher register usage and lower occupancy but is much faster in the end). This is because GPUs have more registers than local memory (128-256KB vs 48KB for NVIDIA), and registers offer as much bandwidth as the ALUs can handle.
来源:https://stackoverflow.com/questions/16748604/opencl-matrix-multiplication-should-be-faster