Can someone help me rewrite this one function (the doTheMath
function) to do the calculations on the GPU? I used a few good days now trying to get my head
Introduction and solution code
Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule
functionality that lets us write and compile CUDA kernels staying in Python environment.
Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.
Listing out the PyCUDA-NumPy solution code -
1] PyCUDA part -
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule
mod = SourceModule("""
#define TBP 1024 // THREADS_PER_BLOCK
__device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
{
int tid = threadIdx.x;
int inv = TBP;
__shared__ float dS[TBP][2];
dS[tid][0] = d1[tid+offset];
dS[tid][1] = d2[tid+offset];
__syncthreads();
if(tid= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
__syncthreads();
if(tid= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
__syncthreads();
inv = inv/2;
while(inv!=0)
{
if(tid
Please note that THREADS_PER_BLOCK, TBP
is to be set based on the batchSize
. The rule of thumb here is to assign power of 2 value to TBP
that is just lesser than batchSize
. Thus, for batchSize = 2000
, we needed TBP
as 1024
.
2] NumPy part -
def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
func1 = mod.get_function("main1")
outlen = len(A)-batchSize+1
# Set block and grid sizes
BSZ = (1024,1,1)
GSZ = (outlen,1)
dest = np.zeros(outlen).astype(np.float32)
N = np.int32(batchSize)
func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
drv.In(data2b), drv.In(data2a),\
drv.In(N), block=BSZ, grid=GSZ)
idx = np.flatnonzero(dest >= minimumLimit)
return idx, dest[idx]
Benchmarking
I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.
Original approach -
def org_app(data1, batchSize, minimumLimit):
resultArray = []
for rowNr in range(data1.shape[0]-batchSize+1):
tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
result = doTheMath(tmp_df, data2a, data2b)
if (result >= minimumLimit):
resultArray.append([rowNr , result])
return resultArray
Timings and verification -
In [2]: #Declare variables
...: batchSize = 2000
...: sampleSize = 50000
...: resultArray = []
...: minimumLimit = 490 #use 400 on the real sample data
...:
...: #Create Random Sample Data
...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
...:
...: # Make column copies
...: A = data1[:,0].copy()
...: B = data1[:,1].copy()
...: C = data1[:,2].copy()
...: D = data1[:,3].copy()
...:
...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
...: print(np.allclose(gpu_out1, cpu_out1))
...: print(np.allclose(gpu_out2, cpu_out2))
...:
True
False
So, there's some differences between CPU and GPU countings. Let's investigate them -
In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))
In [8]: idx
Out[8]: array([12776, 15208, 17620, 18326])
In [9]: gpu_out2[idx] - cpu_out2[idx]
Out[9]: array([-1., -1., 1., 1.])
There are four instances of non-matching counts. These are off at max by 1
. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.
Finally, puting the issue aside, let's get to the most important bit, the performance -
In [10]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 2.18 s per loop
In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
10 loops, best of 3: 82.5 ms per loop
In [12]: 2180.0/82.5
Out[12]: 26.424242424242426
Let's try with bigger datasets. With sampleSize = 500000
, we get -
In [14]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 23.2 s per loop
In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
1 loops, best of 3: 821 ms per loop
In [16]: 23200.0/821
Out[16]: 28.25822168087698
So, the speedup stays constant at around 27
.
Limitations :
1) We are using float32
numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.
Further improvement :
1) We could use faster constant memory
to feed in data2a
and data2b
, rather than use global memory
.