Python: rewrite a looping numpy math function to run on GPU

前端未结

关注

 5  404

既然无缘 2021-01-30 23:31

Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head

5条回答

心在旅途 (楼主)

2021-01-30 23:48

Before you start tweaking the target (GPU) or using anything else (i.e. parallel executions ), you might want to consider how to improve the already existing code. You used the numba-tag so I'll use it to improve the code: First we operate on arrays not on matrices:

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

Each time you call doTheMath you expect an integer back, however you use a lot of arrays and create a lot of intermediate arrays:

abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

This creates an intermediate array each step:

tmp1 = A-Cmin,
tmp2 = tmp1 / dif,
tmp3 = B - Cmin,
tmp4 = tmp3 / dif
... you get the gist.

However this is a reduce function (array -> integer) so having a lot of intermediate arrays is unnecessary weight, just calculate the value of the "fly".

import numba as nb

@nb.njit
def doTheMathNumba(tmpData, data2a, data2b):
    Bmax = np.max(tmpData[:, 1])
    Cmin = np.min(tmpData[:, 2])
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    sum_ = 0
    for i in range(tmpData.shape[0]):
        val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

I did something else here to avoid multiple operations:

(((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
= ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
= (A + B + C + D - 4 * Cmin) / (4 * dif)
= (A + B + C + D) / (4 * dif) - (Cmin / dif)

This actually cuts down the execution time by almost a factor of 10 on my computer:

%timeit doTheMath(tmp_df, data2a, data2b)       # 1000 loops, best of 3: 446 µs per loop
%timeit doTheMathNumba(tmp_df, data2a, data2b)  # 10000 loops, best of 3: 59 µs per loop

There are certainly also other improvements, for example using a rolling min/max to calculate Bmax and Cmin, that would make at least part of the calculation run in O(sampleSize) instead of O(samplesize * batchsize). This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated to do because the comparisons differ. But definitely possible! See here:

import time
import numpy as np
import numba as nb

@nb.njit
def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
    diff = (Bmax - Cmin)
    idiff = 1 / diff
    quarter_idiff = 0.25 * idiff
    sum_ = 0
    for i in range(abcd.shape[0]):
        val = abcd[i] * quarter_idiff - Cmin * idiff
        if val <= data2a[i] and val >= data2b[i]:
            sum_ += 1
    return sum_

@nb.njit
def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
    found = 0
    for rowNr in range(data1.shape[0]):
        if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
            result = doTheMathNumba(abcd[rowNr:rowNr + batchSize], 
                                    data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
            if (result >= minimumLimit):
                resultArray[found, 0] = rowNr
                resultArray[found, 1] = result
                found += 1
    return resultArray[:found]

#Declare variables
batchSize = 2000
sampleSize = 50000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data 

data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit

from scipy import ndimage
t0 = time.time()
abcd = np.sum(data1, axis=1)
Bmaxs = ndimage.maximum_filter1d(data1[:, 1], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))  # correction for even shapes
Cmins = ndimage.minimum_filter1d(data1[:, 2], 
                                 size=batchSize, 
                                 origin=-((batchSize-1)//2-1))

result = np.zeros((sampleSize, 2), dtype=np.int64)
doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
print('Runtime:', time.time() - t0)

This gives me a Runtime: 0.759593152999878 (after numba compiled the functions!), while your original took had Runtime: 24.68975639343262. Now we're 30 times faster!

With your sample size it still takes Runtime: 60.187848806381226 but that's not too bad, right?

And even if I haven't done this myself, numba says that it's possible to write "Numba for CUDA GPUs" and it doesn't seem to complicated.

0 讨论(0)

查看其它5个回答