Python: rewrite a looping numpy math function to run on GPU

前端 未结 5 394
既然无缘
既然无缘 2021-01-30 23:31

Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head

5条回答
  •  醉梦人生
    2021-01-30 23:38

    Introduction and solution code

    Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule functionality that lets us write and compile CUDA kernels staying in Python environment.

    Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.

    Listing out the PyCUDA-NumPy solution code -

    1] PyCUDA part -

    import pycuda.autoinit
    import pycuda.driver as drv
    import numpy as np
    from pycuda.compiler import SourceModule
    
    mod = SourceModule("""
    #define TBP 1024 // THREADS_PER_BLOCK
    
    __device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
    {
        int tid = threadIdx.x;
        int inv = TBP;
        __shared__ float dS[TBP][2];
    
        dS[tid][0] = d1[tid+offset];  
        dS[tid][1] = d2[tid+offset];         
        __syncthreads();
    
        if(tid= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
         __syncthreads();
    
         if(tid= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
         __syncthreads();
    
        inv = inv/2;
        while(inv!=0)   
        {
            if(tid

    Please note that THREADS_PER_BLOCK, TBP is to be set based on the batchSize. The rule of thumb here is to assign power of 2 value to TBP that is just lesser than batchSize. Thus, for batchSize = 2000, we needed TBP as 1024.

    2] NumPy part -

    def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
        func1 = mod.get_function("main1")
        outlen = len(A)-batchSize+1
    
        # Set block and grid sizes
        BSZ = (1024,1,1)
        GSZ = (outlen,1)
    
        dest = np.zeros(outlen).astype(np.float32)
        N = np.int32(batchSize)
        func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
                         drv.In(data2b), drv.In(data2a),\
                         drv.In(N), block=BSZ, grid=GSZ)
        idx = np.flatnonzero(dest >= minimumLimit)
        return idx, dest[idx]
    

    Benchmarking

    I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.

    Original approach -

    def org_app(data1, batchSize, minimumLimit):
        resultArray = []
        for rowNr in  range(data1.shape[0]-batchSize+1):
            tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
            result = doTheMath(tmp_df, data2a, data2b)
            if (result >= minimumLimit):
                resultArray.append([rowNr , result]) 
        return resultArray
    

    Timings and verification -

    In [2]: #Declare variables
       ...: batchSize = 2000
       ...: sampleSize = 50000
       ...: resultArray = []
       ...: minimumLimit = 490 #use 400 on the real sample data
       ...: 
       ...: #Create Random Sample Data
       ...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
       ...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
       ...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
       ...: 
       ...: # Make column copies
       ...: A = data1[:,0].copy()
       ...: B = data1[:,1].copy()
       ...: C = data1[:,2].copy()
       ...: D = data1[:,3].copy()
       ...: 
       ...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
       ...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
       ...: print(np.allclose(gpu_out1, cpu_out1))
       ...: print(np.allclose(gpu_out2, cpu_out2))
       ...: 
    True
    False
    

    So, there's some differences between CPU and GPU countings. Let's investigate them -

    In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))
    
    In [8]: idx
    Out[8]: array([12776, 15208, 17620, 18326])
    
    In [9]: gpu_out2[idx] - cpu_out2[idx]
    Out[9]: array([-1., -1.,  1.,  1.])
    

    There are four instances of non-matching counts. These are off at max by 1. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.

    Finally, puting the issue aside, let's get to the most important bit, the performance -

    In [10]: %timeit org_app(data1, batchSize, minimumLimit)
    1 loops, best of 3: 2.18 s per loop
    
    In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
    10 loops, best of 3: 82.5 ms per loop
    
    In [12]: 2180.0/82.5
    Out[12]: 26.424242424242426
    

    Let's try with bigger datasets. With sampleSize = 500000, we get -

    In [14]: %timeit org_app(data1, batchSize, minimumLimit)
    1 loops, best of 3: 23.2 s per loop
    
    In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
    1 loops, best of 3: 821 ms per loop
    
    In [16]: 23200.0/821
    Out[16]: 28.25822168087698
    

    So, the speedup stays constant at around 27.

    Limitations :

    1) We are using float32 numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.

    Further improvement :

    1) We could use faster constant memory to feed in data2a and data2b, rather than use global memory.

提交回复
热议问题