Python: rewrite a looping numpy math function to run on GPU

前端 未结 5 399
既然无缘
既然无缘 2021-01-30 23:31

Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head

相关标签:
5条回答
  • 2021-01-30 23:38

    Introduction and solution code

    Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule functionality that lets us write and compile CUDA kernels staying in Python environment.

    Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.

    Listing out the PyCUDA-NumPy solution code -

    1] PyCUDA part -

    import pycuda.autoinit
    import pycuda.driver as drv
    import numpy as np
    from pycuda.compiler import SourceModule
    
    mod = SourceModule("""
    #define TBP 1024 // THREADS_PER_BLOCK
    
    __device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
    {
        int tid = threadIdx.x;
        int inv = TBP;
        __shared__ float dS[TBP][2];
    
        dS[tid][0] = d1[tid+offset];  
        dS[tid][1] = d2[tid+offset];         
        __syncthreads();
    
        if(tid<L-TBP)  
        {
            dS[tid][0] = fmaxf(dS[tid][0] , d1[tid+inv+offset]);
            dS[tid][1] = fminf(dS[tid][1] , d2[tid+inv+offset]);
        }
        __syncthreads();
        inv = inv/2;
    
        while(inv!=0)   
        {
            if(tid<inv)
            {
                dS[tid][0] = fmaxf(dS[tid][0] , dS[tid+inv][0]);
                dS[tid][1] = fminf(dS[tid][1] , dS[tid+inv][1]);
            }
            __syncthreads();
            inv = inv/2;
        }
        __syncthreads();
    
        if(tid==0)
        {
            out[0] = dS[0][0];
            out[1] = dS[0][1];
        }   
        __syncthreads();
    }
    
    __global__ void main1(float* out, float *d0, float *d1, float *d2, float *d3, float *lowL, float *highL, int *BLOCKLEN)
    {
        int L = BLOCKLEN[0];
        int tid = threadIdx.x;
        int iterID = blockIdx.x;
        float Bmax_Cmin[2];
        int inv;
        float Cmin, dif;   
        __shared__ float dS[TBP*2];   
    
        get_Bmax_Cmin(Bmax_Cmin, d1, d2, L, iterID);  
        Cmin = Bmax_Cmin[1];
        dif = (Bmax_Cmin[0] - Cmin);
    
        inv = TBP;
    
        dS[tid] = (d0[tid+iterID] + d1[tid+iterID] + d2[tid+iterID] + d3[tid+iterID] - 4.0*Cmin) / (4.0*dif);
        __syncthreads();
    
        if(tid<L-TBP)  
            dS[tid+inv] = (d0[tid+inv+iterID] + d1[tid+inv+iterID] + d2[tid+inv+iterID] + d3[tid+inv+iterID] - 4.0*Cmin) / (4.0*dif);                   
    
         dS[tid] = ((dS[tid] >= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
         __syncthreads();
    
         if(tid<L-TBP)
             dS[tid] += ((dS[tid+inv] >= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
         __syncthreads();
    
        inv = inv/2;
        while(inv!=0)   
        {
            if(tid<inv)
                dS[tid] += dS[tid+inv];
            __syncthreads();
            inv = inv/2;
        }
    
        if(tid==0)
            out[iterID] = dS[0];
        __syncthreads();
    
    }
    """)
    

    Please note that THREADS_PER_BLOCK, TBP is to be set based on the batchSize. The rule of thumb here is to assign power of 2 value to TBP that is just lesser than batchSize. Thus, for batchSize = 2000, we needed TBP as 1024.

    2] NumPy part -

    def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
        func1 = mod.get_function("main1")
        outlen = len(A)-batchSize+1
    
        # Set block and grid sizes
        BSZ = (1024,1,1)
        GSZ = (outlen,1)
    
        dest = np.zeros(outlen).astype(np.float32)
        N = np.int32(batchSize)
        func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
                         drv.In(data2b), drv.In(data2a),\
                         drv.In(N), block=BSZ, grid=GSZ)
        idx = np.flatnonzero(dest >= minimumLimit)
        return idx, dest[idx]
    

    Benchmarking

    I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.

    Original approach -

    def org_app(data1, batchSize, minimumLimit):
        resultArray = []
        for rowNr in  range(data1.shape[0]-batchSize+1):
            tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
            result = doTheMath(tmp_df, data2a, data2b)
            if (result >= minimumLimit):
                resultArray.append([rowNr , result]) 
        return resultArray
    

    Timings and verification -

    In [2]: #Declare variables
       ...: batchSize = 2000
       ...: sampleSize = 50000
       ...: resultArray = []
       ...: minimumLimit = 490 #use 400 on the real sample data
       ...: 
       ...: #Create Random Sample Data
       ...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
       ...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
       ...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
       ...: 
       ...: # Make column copies
       ...: A = data1[:,0].copy()
       ...: B = data1[:,1].copy()
       ...: C = data1[:,2].copy()
       ...: D = data1[:,3].copy()
       ...: 
       ...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
       ...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
       ...: print(np.allclose(gpu_out1, cpu_out1))
       ...: print(np.allclose(gpu_out2, cpu_out2))
       ...: 
    True
    False
    

    So, there's some differences between CPU and GPU countings. Let's investigate them -

    In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))
    
    In [8]: idx
    Out[8]: array([12776, 15208, 17620, 18326])
    
    In [9]: gpu_out2[idx] - cpu_out2[idx]
    Out[9]: array([-1., -1.,  1.,  1.])
    

    There are four instances of non-matching counts. These are off at max by 1. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.

    Finally, puting the issue aside, let's get to the most important bit, the performance -

    In [10]: %timeit org_app(data1, batchSize, minimumLimit)
    1 loops, best of 3: 2.18 s per loop
    
    In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
    10 loops, best of 3: 82.5 ms per loop
    
    In [12]: 2180.0/82.5
    Out[12]: 26.424242424242426
    

    Let's try with bigger datasets. With sampleSize = 500000, we get -

    In [14]: %timeit org_app(data1, batchSize, minimumLimit)
    1 loops, best of 3: 23.2 s per loop
    
    In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
    1 loops, best of 3: 821 ms per loop
    
    In [16]: 23200.0/821
    Out[16]: 28.25822168087698
    

    So, the speedup stays constant at around 27.

    Limitations :

    1) We are using float32 numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.

    Further improvement :

    1) We could use faster constant memory to feed in data2a and data2b, rather than use global memory.

    0 讨论(0)
  • 2021-01-30 23:39

    Tweak #1

    Its usually advised to vectorize things when working with NumPy arrays. But with very large arrays, I think you are out of options there. So, to boost performance, a minor tweak is possible to optimize on the last step of summing.

    We could replace the step that makes an array of 1s and 0s and does summing :

    np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
    

    with np.count_nonzero that works efficiently to count True values in a boolean array, instead of converting to 1s and 0s -

    np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
    

    Runtime test -

    In [45]: abcd = np.random.randint(11,99,(10000))
    
    In [46]: data2a = np.random.randint(11,99,(10000))
    
    In [47]: data2b = np.random.randint(11,99,(10000))
    
    In [48]: %timeit np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
    10000 loops, best of 3: 81.8 µs per loop
    
    In [49]: %timeit np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
    10000 loops, best of 3: 28.8 µs per loop
    

    Tweak #2

    Use a pre-computed reciprocal when dealing with cases that undergo implicit broadcasting. Some more info here. Thus, store reciprocal of dif and use that instead at the step :

    ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ...
    

    Sample test -

    In [52]: A = np.random.rand(10000)
    
    In [53]: dif = 0.5
    
    In [54]: %timeit A/dif
    10000 loops, best of 3: 25.8 µs per loop
    
    In [55]: %timeit A*(1.0/dif)
    100000 loops, best of 3: 7.94 µs per loop
    

    You have four places using division by dif. So, hopefully this would bring out noticeable boost there too!

    0 讨论(0)
  • 2021-01-30 23:48

    Before you start tweaking the target (GPU) or using anything else (i.e. parallel executions ), you might want to consider how to improve the already existing code. You used the numba-tag so I'll use it to improve the code: First we operate on arrays not on matrices:

    data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
    data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
    data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
    

    Each time you call doTheMath you expect an integer back, however you use a lot of arrays and create a lot of intermediate arrays:

    abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
    

    This creates an intermediate array each step:

    • tmp1 = A-Cmin,
    • tmp2 = tmp1 / dif,
    • tmp3 = B - Cmin,
    • tmp4 = tmp3 / dif
    • ... you get the gist.

    However this is a reduce function (array -> integer) so having a lot of intermediate arrays is unnecessary weight, just calculate the value of the "fly".

    import numba as nb
    
    @nb.njit
    def doTheMathNumba(tmpData, data2a, data2b):
        Bmax = np.max(tmpData[:, 1])
        Cmin = np.min(tmpData[:, 2])
        diff = (Bmax - Cmin)
        idiff = 1 / diff
        sum_ = 0
        for i in range(tmpData.shape[0]):
            val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
            if val <= data2a[i] and val >= data2b[i]:
                sum_ += 1
        return sum_
    

    I did something else here to avoid multiple operations:

    (((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
    = ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
    = (A + B + C + D - 4 * Cmin) / (4 * dif)
    = (A + B + C + D) / (4 * dif) - (Cmin / dif)
    

    This actually cuts down the execution time by almost a factor of 10 on my computer:

    %timeit doTheMath(tmp_df, data2a, data2b)       # 1000 loops, best of 3: 446 µs per loop
    %timeit doTheMathNumba(tmp_df, data2a, data2b)  # 10000 loops, best of 3: 59 µs per loop
    

    There are certainly also other improvements, for example using a rolling min/max to calculate Bmax and Cmin, that would make at least part of the calculation run in O(sampleSize) instead of O(samplesize * batchsize). This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif) calculations because if Cmin and Bmax don't change for the next sample these values don't differ. It's a bit complicated to do because the comparisons differ. But definitely possible! See here:

    import time
    import numpy as np
    import numba as nb
    
    @nb.njit
    def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
        diff = (Bmax - Cmin)
        idiff = 1 / diff
        quarter_idiff = 0.25 * idiff
        sum_ = 0
        for i in range(abcd.shape[0]):
            val = abcd[i] * quarter_idiff - Cmin * idiff
            if val <= data2a[i] and val >= data2b[i]:
                sum_ += 1
        return sum_
    
    @nb.njit
    def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
        found = 0
        for rowNr in range(data1.shape[0]):
            if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
                result = doTheMathNumba(abcd[rowNr:rowNr + batchSize], 
                                        data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
                if (result >= minimumLimit):
                    resultArray[found, 0] = rowNr
                    resultArray[found, 1] = result
                    found += 1
        return resultArray[:found]
    
    #Declare variables
    batchSize = 2000
    sampleSize = 50000
    resultArray = []
    minimumLimit = 490 #use 400 on the real sample data 
    
    data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
    data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
    data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
    
    from scipy import ndimage
    t0 = time.time()
    abcd = np.sum(data1, axis=1)
    Bmaxs = ndimage.maximum_filter1d(data1[:, 1], 
                                     size=batchSize, 
                                     origin=-((batchSize-1)//2-1))  # correction for even shapes
    Cmins = ndimage.minimum_filter1d(data1[:, 2], 
                                     size=batchSize, 
                                     origin=-((batchSize-1)//2-1))
    
    result = np.zeros((sampleSize, 2), dtype=np.int64)
    doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
    print('Runtime:', time.time() - t0)
    

    This gives me a Runtime: 0.759593152999878 (after numba compiled the functions!), while your original took had Runtime: 24.68975639343262. Now we're 30 times faster!

    With your sample size it still takes Runtime: 60.187848806381226 but that's not too bad, right?

    And even if I haven't done this myself, numba says that it's possible to write "Numba for CUDA GPUs" and it doesn't seem to complicated.

    0 讨论(0)
  • 2021-01-30 23:52

    Here is some code to demonstrate what is possible by just tweaking the algorithm. It's pure numpy but on the sample data you posted gives a roughly 35x speedup over the original version (~1,000,000 samples in ~2.5sec on my rather modest machine):

    >>> result_dict = master('run')
    [('load', 0.82578349113464355), ('precomp', 0.028138399124145508), ('max/min', 0.24333405494689941), ('ABCD', 0.015314102172851562), ('main', 1.3356468677520752)]
    TOTAL 2.44821691513
    

    Tweaks used:

    • A+B+C+D, see my other answer
    • running min/max, including avoiding to compute (A+B+C+D - 4Cmin)/(4dif) multiple times with the same Cmin/dif.

    These are more or less routine. That leaves the comparison with data2a/b which is expensive O(NK) where N is the number of samples and K is the size of the window. Here one can take advantage of the relatively well-behaved data. Using the running min/max one can create variants of data2a/b that can be used to test a range of window offsets at a time, if the test fails all these offsets can be ruled out immediately, otherwise the range is bisected.

    import numpy as np
    import time
    
    # global variables; they will hold the precomputed pre-screening filters
    preA, preB = {}, {}
    CHUNK_SIZES = None
    
    def sliding_argmax(data, K=2000):
        """compute the argmax of data over a sliding window of width K
    
        returns:
            indices  -- indices into data
            switches -- window offsets at which the maximum changes
                        (strictly speaking: where the index of the maximum changes)
                        excludes 0 but includes maximum offset (len(data)-K+1)
    
        see last line of compute_pre_screening_filter for a recipe to convert
        this representation to the vector of maxima
        """
        N = len(data)
        last = np.argmax(data[:K])
        indices = [last]
        while indices[-1] <= N - 1:
            ge = np.where(data[last + 1 : last + K + 1] > data[last])[0]
            if len(ge) == 0:
                if last + K >= N:
                    break
                last += 1 + np.argmax(data[last + 1 : last + K + 1])
                indices.append(last)
            else:
                last += 1 + ge[0]
                indices.append(last)
        indices = np.array(indices)
        switches = np.where(data[indices[1:]] > data[indices[:-1]],
                            indices[1:] + (1-K), indices[:-1] + 1)
        return indices, np.r_[switches, [len(data)-K+1]]
    
    
    def compute_pre_screening_filter(bound, n_offs):
        """compute pre-screening filter for point-wise upper bound
    
        given a K-vector of upper bounds B and K+n_offs-1-vector data
        compute K+n_offs-1-vector filter such that for each index j
        if for any offset 0 <= o < n_offs and index 0 <= i < K such that
        o + i = j, the inequality B_i >= data_j holds then filter_j >= data_j
    
        therefore the number of data points below filter is an upper bound for
        the maximum number of points below bound in any K-window in data
        """
        pad_l, pad_r = np.min(bound[:n_offs-1]), np.min(bound[1-n_offs:])
        padded = np.r_[pad_l+np.zeros(n_offs-1,), bound, pad_r+np.zeros(n_offs-1,)]
        indices, switches = sliding_argmax(padded, n_offs)
        return padded[indices].repeat(np.diff(np.r_[[0], switches]))
    
    
    def compute_all_pre_screening_filters(upper, lower, min_chnk=5, dyads=6):
        """compute upper and lower pre-screening filters for data blocks of
        sizes K+n_offs-1 where
        n_offs = min_chnk, 2min_chnk, ..., 2^(dyads-1)min_chnk
    
        the result is stored in global variables preA and preB
        """
        global CHUNK_SIZES
    
        CHUNK_SIZES = min_chnk * 2**np.arange(dyads)
        preA[1] = upper
        preB[1] = lower
        for n in CHUNK_SIZES:
            preA[n] = compute_pre_screening_filter(upper, n)
            preB[n] = -compute_pre_screening_filter(-lower, n)
    
    
    def test_bounds(block, counts, threshold=400):
        """test whether the windows fitting in the data block 'block' fall
        within the bounds using pre-screening for efficient bulk rejection
    
        array 'counts' will be overwritten with the counts of compliant samples
        note that accurate counts will only be returned for above threshold
        windows, because the analysis of bulk rejected windows is short-circuited
    
        also note that bulk rejection only works for 'well behaved' data and
        for example not on random numbers
        """
        N = len(counts)
        K = len(preA[1])
        r = N % CHUNK_SIZES[0]
        # chop up N into as large as possible chunks with matching pre computed
        # filters
        # start with small and work upwards
        counts[:r] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                       (block[l:l+K] >= preB[1]))
                      for l in range(r)]
    
        def bisect(block, counts):
            M = len(counts)
            cnts = np.count_nonzero((block <= preA[M]) & (block >= preB[M]))
            if cnts < threshold:
                counts[:] = cnts
                return
            elif M == CHUNK_SIZES[0]:
                counts[:] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
                                              (block[l:l+K] >= preB[1]))
                             for l in range(M)]
            else:
                M //= 2
                bisect(block[:-M], counts[:M])
                bisect(block[M:], counts[M:])
    
        N = N // CHUNK_SIZES[0]
        for M in CHUNK_SIZES:
            if N % 2:
                bisect(block[r:r+M+K-1], counts[r:r+M])
                r += M
            elif N == 0:
                return
            N //= 2
        else:
            for j in range(2*N):
                bisect(block[r:r+M+K-1], counts[r:r+M])
                r += M
    
    
    def analyse(data, use_pre_screening=True, min_chnk=5, dyads=6,
                threshold=400):
        samples, upper, lower = data
        N, K = samples.shape[0], upper.shape[0]
        times = [time.time()]
        if use_pre_screening:
            compute_all_pre_screening_filters(upper, lower, min_chnk, dyads)
        times.append(time.time())
        # compute switching points of max and min for running normalisation
        upper_inds, upper_swp = sliding_argmax(samples[:, 1], K)
        lower_inds, lower_swp = sliding_argmax(-samples[:, 2], K)
        times.append(time.time())
        # sum columns
        ABCD = samples.sum(axis=-1)
        times.append(time.time())
        counts = np.empty((N-K+1,), dtype=int)
        # main loop
        # loop variables:
        offs = 0
        u_ind, u_scale, u_swp = 0, samples[upper_inds[0], 1], upper_swp[0]
        l_ind, l_scale, l_swp = 0, samples[lower_inds[0], 2], lower_swp[0]
        while True:
            # check which is switching next, min(C) or max(B)
            if u_swp > l_swp:
                # greedily take the largest block possible such that dif and Cmin
                # do not change
                block = (ABCD[offs:l_swp+K-1] - 4*l_scale) \
                        * (0.25 / (u_scale-l_scale))
                if use_pre_screening:
                    test_bounds(block, counts[offs:l_swp], threshold=threshold)
                else:
                    counts[offs:l_swp] = [
                        np.count_nonzero((block[l:l+K] <= upper) &
                                         (block[l:l+K] >= lower))
                        for l in range(l_swp - offs)]
                # book keeping
                l_ind += 1
                offs = l_swp
                l_swp = lower_swp[l_ind]
                l_scale = samples[lower_inds[l_ind], 2]
            else:
                block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \
                        * (0.25 / (u_scale-l_scale))
                if use_pre_screening:
                    test_bounds(block, counts[offs:u_swp], threshold=threshold)
                else:
                    counts[offs:u_swp] = [
                        np.count_nonzero((block[l:l+K] <= upper) &
                                         (block[l:l+K] >= lower))
                        for l in range(u_swp - offs)]
                u_ind += 1
                if u_ind == len(upper_inds):
                    assert u_swp == N-K+1
                    break
                offs = u_swp
                u_swp = upper_swp[u_ind]
                u_scale = samples[upper_inds[u_ind], 1]
        times.append(time.time())
        return {'counts': counts, 'valid': np.where(counts >= 400)[0],
                'timings': np.diff(times)}
    
    
    def master(mode='calibrate', data='fake', use_pre_screening=True, nrep=3,
               min_chnk=None, dyads=None):
        t = time.time()
        if data in ('fake', 'load'):
            data1 = np.loadtxt('data1.csv', delimiter=';', skiprows=1,
                               usecols=[1,2,3,4])
            data2a = np.loadtxt('data2a.csv', delimiter=';', skiprows=1,
                                usecols=[1])
            data2b = np.loadtxt('data2b.csv', delimiter=';', skiprows=1,
                                usecols=[1])
            if data == 'fake':
                data1 = np.tile(data1, (10, 1))
            threshold = 400
        elif data == 'random':
            data1 = np.random.random((102000, 4))
            data2b = np.random.random(2000)
            data2a = np.random.random(2000)
            threshold = 490
            if use_pre_screening or mode == 'calibrate':
                print('WARNING: pre-screening not efficient on artificial data')
        else:
            raise ValueError("data mode {} not recognised".format(data))
        data = data1, data2a, data2b
        t_load = time.time() - t
        if mode == 'calibrate':
            min_chnk = (2, 3, 4, 5, 6) if min_chnk is None else min_chnk
            dyads = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) if dyads is None else dyads
            timings = np.zeros((len(min_chnk), len(dyads)))
            print('max bisect  ' + ' '.join([
                '   n.a.' if dy == 0 else '{:7d}'.format(dy) for dy in dyads]),
                  end='')
            for i, mc in enumerate(min_chnk):
                print('\nmin chunk {}'.format(mc), end=' ')
                for j, dy in enumerate(dyads):
                    for k in range(nrep):
                        if dy == 0: # no pre-screening
                            timings[i, j] += analyse(
                                data, False, mc, dy, threshold)['timings'][3]
                        else:
                            timings[i, j] += analyse(
                                data, True, mc, dy, threshold)['timings'][3]
                    timings[i, j] /= nrep
                    print('{:7.3f}'.format(timings[i, j]), end=' ', flush=True)
            best_mc, best_dy = np.unravel_index(np.argmin(timings.ravel()),
                                                timings.shape)
            print('\nbest', min_chnk[best_mc], dyads[best_dy])
            return timings, min_chnk[best_mc], dyads[best_dy]
        if mode == 'run':
            min_chnk = 2 if min_chnk is None else min_chnk
            dyads = 5 if dyads is None else dyads
            res = analyse(data, use_pre_screening, min_chnk, dyads, threshold)
            times = np.r_[[t_load], res['timings']]
            print(list(zip(('load', 'precomp', 'max/min', 'ABCD', 'main'), times)))
            print('TOTAL', times.sum())
            return res
    
    0 讨论(0)
  • 2021-01-30 23:52

    This is technically off-topic (not GPU) but I'm sure you'll be interested.

    There is one obvious and rather large saving:

    Precompute A + B + C + D (not in the loop, on the whole data: data1.sum(axis=-1)), because abcd = ((A+B+C+D) - 4Cmin) / (4dif). This will save quite a few ops.

    Surprised nobody spotted that one before ;-)

    Edit:

    There is another thing, though I suspect that's only in your example, not in your real data:

    As it stands roughly half of data2a will be smaller than data2b. In these places your conditions on abcd cannot be both True, so you needn't even compute abcd there.

    Edit:

    One more tweak I used below but forgot to mention: If you compute the max (or min) over a moving window. When you move one point to the right, say, how likely is the max to change? There are only two things that can change it: the new point on the right is larger (happens roughly once in windowlength times, and even if it happens, you immediately know the new max) or the old max falls off the window on the left (also happens roughly once in windowlength times). Only in this last case you have to search the entire window for the new max.

    Edit:

    Couldn't resist giving it a try in tensorflow. I don't have a GPU, so you yourself have to test it for speed. Put "gpu" for "cpu" on the marked line.

    On cpu it is about half as fast as your original implementation (i.e. without Divakar's tweaks). Note that I've taken the liberty of changing the inputs from matrix to plain array. Currently tensorflow is a bit of a moving target, so make sure you have the right version. I used Python3.6 and tf 0.12.1 If you do a pip3 install tensorflow-gpu today it should might work.

    import numpy as np
    import time
    import tensorflow as tf
    
    # currently the max/min code is sequential
    # thus
    parallel_iterations = 1
    # but you can put this in a separate loop, precompute and then try and run
    # the remainder of doTheMathTF with a larger parallel_iterations
    
    # tensorflow is quite capricious about its data types
    ddf = tf.float64
    ddi = tf.int32
    
    def worker(data1, data2a, data2b):
        ###################################
        # CHANGE cpu to gpu in next line! #
        ###################################
        with tf.device('/cpu:0'):
            g = tf.Graph ()
            with g.as_default():
                ABCD = tf.constant(data1.sum(axis=-1), dtype=ddf)
                B = tf.constant(data1[:, 1], dtype=ddf)
                C = tf.constant(data1[:, 2], dtype=ddf)
                window = tf.constant(len(data2a))
                N = tf.constant(data1.shape[0] - len(data2a) + 1, dtype=ddi)
                data2a = tf.constant(data2a, dtype=ddf)
                data2b = tf.constant(data2b, dtype=ddf)
                def doTheMathTF(i, Bmax, Bmaxind, Cmin, Cminind, out):
                    # most of the time we can keep the old max/min
                    Bmaxind = tf.cond(Bmaxind<i,
                                      lambda: i + tf.to_int32(
                                          tf.argmax(B[i:i+window], axis=0)),
                                      lambda: tf.cond(Bmax>B[i+window-1], 
                                                      lambda: Bmaxind, 
                                                      lambda: i+window-1))
                    Cminind = tf.cond(Cminind<i,
                                      lambda: i + tf.to_int32(
                                          tf.argmin(C[i:i+window], axis=0)),
                                      lambda: tf.cond(Cmin<C[i+window-1],
                                                      lambda: Cminind,
                                                      lambda: i+window-1))
                    Bmax = B[Bmaxind]
                    Cmin = C[Cminind]
                    abcd = (ABCD[i:i+window] - 4 * Cmin) * (1 / (4 * (Bmax-Cmin)))
                    out = out.write(i, tf.to_int32(
                        tf.count_nonzero(tf.logical_and(abcd <= data2a,
                                                        abcd >= data2b))))
                    return i + 1, Bmax, Bmaxind, Cmin, Cminind, out
                with tf.Session(graph=g) as sess:
                    i, Bmaxind, Bmax, Cminind, Cmin, out = tf.while_loop(
                        lambda i, _1, _2, _3, _4, _5: i<N, doTheMathTF,
                        (tf.Variable(0, dtype=ddi), tf.Variable(0.0, dtype=ddf),
                         tf.Variable(-1, dtype=ddi),
                         tf.Variable(0.0, dtype=ddf), tf.Variable(-1, dtype=ddi),
                         tf.TensorArray(ddi, size=N)),
                        shape_invariants=None,
                        parallel_iterations=parallel_iterations,
                        back_prop=False)
                    out = out.pack()
                    sess.run(tf.initialize_all_variables())
                    out, = sess.run((out,))
        return out
    
    #Declare variables
    batchSize = 2000
    sampleSize = 50000#00
    resultArray = []
    
    #Create Sample Data
    data1 = np.random.uniform(1, 100, (sampleSize + batchSize, 4))
    data2a = np.random.uniform(0, 1, (batchSize,))
    data2b = np.random.uniform(0, 1, (batchSize,))
    
    t0 = time.time()
    out = worker(data1, data2a, data2b)
    print('Runtime (tensorflow):', time.time() - t0)
    
    
    good_indices, = np.where(out >= 490)
    res_tf = np.c_[good_indices, out[good_indices]]
    
    def doTheMath(tmpData1, data2a, data2b):
        A = tmpData1[:, 0]
        B  = tmpData1[:,1]
        C   = tmpData1[:,2]
        D = tmpData1[:,3]
        Bmax = B.max()
        Cmin  = C.min()
        dif = (Bmax - Cmin)
        abcd = ((((A  - Cmin) / dif) + ((B  - Cmin) / dif) + ((C   - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
        return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
    
    #Loop through the data
    t0 = time.time()
    for rowNr in  range(sampleSize+1):
        tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= 490):
            resultArray.append([rowNr , result])
    print('Runtime (original):', time.time() - t0)
    print(np.alltrue(np.array(resultArray)==res_tf))
    
    0 讨论(0)
提交回复
热议问题