OpenCL matrix multiplication should be faster?

安稳与你 提交于 2019-12-05 14:14:17

Without even looking at your code let me make some comments about your benchmarks. Let's ignore numpy and compare the maximum SP FLOPs/s and DP FLOPs/s of an Intel CPU versus Nvidia and AMD GPUs.

A Intel 2600K at 4 GHz can do 4 GHz * (8 AVX) * (2 ILP) * ( 4 cores) = 256 SP GFLOPs/s. For DP it's half: 128 DP GFLOPs/s. Haswell which comes out in a few weeks will double both of those. The Intel MKL library gets better than 80% efficiency in GEMM. My own GEMM code gets 70% on my i7-2700 so the 5 GFlops/s you quote with numpy is tiny and not fair to compare with.

I don't know what the GTX 275 is capable of but I would guess it's much more than 50 GFLOPs/s.

The article you reference compares a AMD 7970. They get 848 (90% efficiency) DP GFlops/s and 2646 (70% efficiency) SP GFlops/s. That's closer to 10x the performance of the CPU not 200x!

Edit: Your calculations of FLOPs is wrong it should be 2.0*n^3. That's still approximate but it's asymptotically true. Let me explain.

Consider a 3D dot product. It's x1*x2+y1*y2+z1*z2. That's 3 multiplications and two additions. So a N-dimensional dot product is n multiplications and (n-1) additions. A matrix product is equivalent to nxn dot products, i.e. n*n*n multiplications and n*n*(n-1) additions. That's approximately 2.0*n^3 FLOPS. So you should double all your Gflops/s numbers.

Edit: You might want to consider the kernel time. It's been awhile since I used OpenCL but using the C++ bindings I did something like this

queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
//other code...run kernel

time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>();  
time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>();

A good GPU matrix-multiply does not just use local memory, it stores blocks of A, B, and/or C in registers (which results in higher register usage and lower occupancy but is much faster in the end). This is because GPUs have more registers than local memory (128-256KB vs 48KB for NVIDIA), and registers offer as much bandwidth as the ALUs can handle.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!