Python: rewrite a looping numpy math function to run on GPU

前端 未结 5 404
既然无缘
既然无缘 2021-01-30 23:31

Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head

5条回答
  •  心在旅途
    2021-01-30 23:48

    Before you start tweaking the target (GPU) or using anything else (i.e. parallel executions ), you might want to consider how to improve the already existing code. You used the numba-tag so I'll use it to improve the code: First we operate on arrays not on matrices:

    data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
    data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
    data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
    

    Each time you call doTheMath you expect an integer back, however you use a lot of arrays and create a lot of intermediate arrays:

    abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
    

    This creates an intermediate array each step:

    • tmp1 = A-Cmin,
    • tmp2 = tmp1 / dif,
    • tmp3 = B - Cmin,
    • tmp4 = tmp3 / dif
    • ... you get the gist.

    However this is a reduce function (array -> integer) so having a lot of intermediate arrays is unnecessary weight, just calculate the value of the "fly".

    import numba as nb
    
    @nb.njit
    def doTheMathNumba(tmpData, data2a, data2b):
        Bmax = np.max(tmpData[:, 1])
        Cmin = np.min(tmpData[:, 2])
        diff = (Bmax - Cmin)
        idiff = 1 / diff
        sum_ = 0
        for i in range(tmpData.shape[0]):
            val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
            if val <= data2a[i] and val >= data2b[i]:
                sum_ += 1
        return sum_
    

    I did something else here to avoid multiple operations:

    (((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
    = ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
    = (A + B + C + D - 4 * Cmin) / (4 * dif)
    = (A + B + C + D) / (4 * dif) - (Cmin / dif)
    

    This actually cuts down the execution time by almost a factor of 10 on my computer:

    %timeit doTheMath(tmp_df, data2a, data2b)       # 1000 loops, best of 3: 446 µs per loop
    %timeit doTheMathNumba(tmp_df, data2a, data2b)  # 10000 loops, best of 3: 59 µs per loop
    

    There are certainly also other improvements, for example using a rolling min/max to calculate Bmax and Cmin, that would make at least part of the calculation run in O(sampleSize) instead of O(samplesize * batchsize). This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated to do because the comparisons differ. But definitely possible! See here:

    import time
    import numpy as np
    import numba as nb
    
    @nb.njit
    def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
        diff = (Bmax - Cmin)
        idiff = 1 / diff
        quarter_idiff = 0.25 * idiff
        sum_ = 0
        for i in range(abcd.shape[0]):
            val = abcd[i] * quarter_idiff - Cmin * idiff
            if val <= data2a[i] and val >= data2b[i]:
                sum_ += 1
        return sum_
    
    @nb.njit
    def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
        found = 0
        for rowNr in range(data1.shape[0]):
            if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
                result = doTheMathNumba(abcd[rowNr:rowNr + batchSize], 
                                        data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
                if (result >= minimumLimit):
                    resultArray[found, 0] = rowNr
                    resultArray[found, 1] = result
                    found += 1
        return resultArray[:found]
    
    #Declare variables
    batchSize = 2000
    sampleSize = 50000
    resultArray = []
    minimumLimit = 490 #use 400 on the real sample data 
    
    data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
    data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
    data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
    
    from scipy import ndimage
    t0 = time.time()
    abcd = np.sum(data1, axis=1)
    Bmaxs = ndimage.maximum_filter1d(data1[:, 1], 
                                     size=batchSize, 
                                     origin=-((batchSize-1)//2-1))  # correction for even shapes
    Cmins = ndimage.minimum_filter1d(data1[:, 2], 
                                     size=batchSize, 
                                     origin=-((batchSize-1)//2-1))
    
    result = np.zeros((sampleSize, 2), dtype=np.int64)
    doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
    print('Runtime:', time.time() - t0)
    

    This gives me a Runtime: 0.759593152999878 (after numba compiled the functions!), while your original took had Runtime: 24.68975639343262. Now we're 30 times faster!

    With your sample size it still takes Runtime: 60.187848806381226 but that's not too bad, right?

    And even if I haven't done this myself, numba says that it's possible to write "Numba for CUDA GPUs" and it doesn't seem to complicated.

提交回复
热议问题