Can someone help me rewrite this one function (the doTheMath
function) to do the calculations on the GPU? I used a few good days now trying to get my head
Introduction and solution code
Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule
functionality that lets us write and compile CUDA kernels staying in Python environment.
Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.
Listing out the PyCUDA-NumPy solution code -
1] PyCUDA part -
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule
mod = SourceModule("""
#define TBP 1024 // THREADS_PER_BLOCK
__device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
{
int tid = threadIdx.x;
int inv = TBP;
__shared__ float dS[TBP][2];
dS[tid][0] = d1[tid+offset];
dS[tid][1] = d2[tid+offset];
__syncthreads();
if(tid<L-TBP)
{
dS[tid][0] = fmaxf(dS[tid][0] , d1[tid+inv+offset]);
dS[tid][1] = fminf(dS[tid][1] , d2[tid+inv+offset]);
}
__syncthreads();
inv = inv/2;
while(inv!=0)
{
if(tid<inv)
{
dS[tid][0] = fmaxf(dS[tid][0] , dS[tid+inv][0]);
dS[tid][1] = fminf(dS[tid][1] , dS[tid+inv][1]);
}
__syncthreads();
inv = inv/2;
}
__syncthreads();
if(tid==0)
{
out[0] = dS[0][0];
out[1] = dS[0][1];
}
__syncthreads();
}
__global__ void main1(float* out, float *d0, float *d1, float *d2, float *d3, float *lowL, float *highL, int *BLOCKLEN)
{
int L = BLOCKLEN[0];
int tid = threadIdx.x;
int iterID = blockIdx.x;
float Bmax_Cmin[2];
int inv;
float Cmin, dif;
__shared__ float dS[TBP*2];
get_Bmax_Cmin(Bmax_Cmin, d1, d2, L, iterID);
Cmin = Bmax_Cmin[1];
dif = (Bmax_Cmin[0] - Cmin);
inv = TBP;
dS[tid] = (d0[tid+iterID] + d1[tid+iterID] + d2[tid+iterID] + d3[tid+iterID] - 4.0*Cmin) / (4.0*dif);
__syncthreads();
if(tid<L-TBP)
dS[tid+inv] = (d0[tid+inv+iterID] + d1[tid+inv+iterID] + d2[tid+inv+iterID] + d3[tid+inv+iterID] - 4.0*Cmin) / (4.0*dif);
dS[tid] = ((dS[tid] >= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
__syncthreads();
if(tid<L-TBP)
dS[tid] += ((dS[tid+inv] >= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
__syncthreads();
inv = inv/2;
while(inv!=0)
{
if(tid<inv)
dS[tid] += dS[tid+inv];
__syncthreads();
inv = inv/2;
}
if(tid==0)
out[iterID] = dS[0];
__syncthreads();
}
""")
Please note that THREADS_PER_BLOCK, TBP
is to be set based on the batchSize
. The rule of thumb here is to assign power of 2 value to TBP
that is just lesser than batchSize
. Thus, for batchSize = 2000
, we needed TBP
as 1024
.
2] NumPy part -
def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
func1 = mod.get_function("main1")
outlen = len(A)-batchSize+1
# Set block and grid sizes
BSZ = (1024,1,1)
GSZ = (outlen,1)
dest = np.zeros(outlen).astype(np.float32)
N = np.int32(batchSize)
func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
drv.In(data2b), drv.In(data2a),\
drv.In(N), block=BSZ, grid=GSZ)
idx = np.flatnonzero(dest >= minimumLimit)
return idx, dest[idx]
Benchmarking
I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.
Original approach -
def org_app(data1, batchSize, minimumLimit):
resultArray = []
for rowNr in range(data1.shape[0]-batchSize+1):
tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
result = doTheMath(tmp_df, data2a, data2b)
if (result >= minimumLimit):
resultArray.append([rowNr , result])
return resultArray
Timings and verification -
In [2]: #Declare variables
...: batchSize = 2000
...: sampleSize = 50000
...: resultArray = []
...: minimumLimit = 490 #use 400 on the real sample data
...:
...: #Create Random Sample Data
...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
...:
...: # Make column copies
...: A = data1[:,0].copy()
...: B = data1[:,1].copy()
...: C = data1[:,2].copy()
...: D = data1[:,3].copy()
...:
...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
...: print(np.allclose(gpu_out1, cpu_out1))
...: print(np.allclose(gpu_out2, cpu_out2))
...:
True
False
So, there's some differences between CPU and GPU countings. Let's investigate them -
In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))
In [8]: idx
Out[8]: array([12776, 15208, 17620, 18326])
In [9]: gpu_out2[idx] - cpu_out2[idx]
Out[9]: array([-1., -1., 1., 1.])
There are four instances of non-matching counts. These are off at max by 1
. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.
Finally, puting the issue aside, let's get to the most important bit, the performance -
In [10]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 2.18 s per loop
In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
10 loops, best of 3: 82.5 ms per loop
In [12]: 2180.0/82.5
Out[12]: 26.424242424242426
Let's try with bigger datasets. With sampleSize = 500000
, we get -
In [14]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 23.2 s per loop
In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
1 loops, best of 3: 821 ms per loop
In [16]: 23200.0/821
Out[16]: 28.25822168087698
So, the speedup stays constant at around 27
.
Limitations :
1) We are using float32
numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.
Further improvement :
1) We could use faster constant memory
to feed in data2a
and data2b
, rather than use global memory
.
Tweak #1
Its usually advised to vectorize things when working with NumPy arrays. But with very large arrays, I think you are out of options there. So, to boost performance, a minor tweak is possible to optimize on the last step of summing.
We could replace the step that makes an array of 1s
and 0s
and does summing :
np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
with np.count_nonzero
that works efficiently to count True
values in a boolean array, instead of converting to 1s
and 0s
-
np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
Runtime test -
In [45]: abcd = np.random.randint(11,99,(10000))
In [46]: data2a = np.random.randint(11,99,(10000))
In [47]: data2b = np.random.randint(11,99,(10000))
In [48]: %timeit np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
10000 loops, best of 3: 81.8 µs per loop
In [49]: %timeit np.count_nonzero((abcd <= data2a) & (abcd >= data2b))
10000 loops, best of 3: 28.8 µs per loop
Tweak #2
Use a pre-computed reciprocal when dealing with cases that undergo implicit broadcasting. Some more info here. Thus, store reciprocal of dif
and use that instead at the step :
((((A - Cmin) / dif) + ((B - Cmin) / dif) + ...
Sample test -
In [52]: A = np.random.rand(10000)
In [53]: dif = 0.5
In [54]: %timeit A/dif
10000 loops, best of 3: 25.8 µs per loop
In [55]: %timeit A*(1.0/dif)
100000 loops, best of 3: 7.94 µs per loop
You have four places using division by dif
. So, hopefully this would bring out noticeable boost there too!
Before you start tweaking the target (GPU) or using anything else (i.e. parallel executions ), you might want to consider how to improve the already existing code. You used the numba-tag so I'll use it to improve the code: First we operate on arrays not on matrices:
data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
Each time you call doTheMath
you expect an integer back, however you use a lot of arrays and create a lot of intermediate arrays:
abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
This creates an intermediate array each step:
tmp1 = A-Cmin
, tmp2 = tmp1 / dif
, tmp3 = B - Cmin
, tmp4 = tmp3 / dif
However this is a reduce function (array -> integer) so having a lot of intermediate arrays is unnecessary weight, just calculate the value of the "fly".
import numba as nb
@nb.njit
def doTheMathNumba(tmpData, data2a, data2b):
Bmax = np.max(tmpData[:, 1])
Cmin = np.min(tmpData[:, 2])
diff = (Bmax - Cmin)
idiff = 1 / diff
sum_ = 0
for i in range(tmpData.shape[0]):
val = (tmpData[i, 0] + tmpData[i, 1] + tmpData[i, 2] + tmpData[i, 3]) / 4 * idiff - Cmin * idiff
if val <= data2a[i] and val >= data2b[i]:
sum_ += 1
return sum_
I did something else here to avoid multiple operations:
(((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4
= ((A - Cmin + B - Cmin + C - Cmin + D - Cmin) / dif) / 4
= (A + B + C + D - 4 * Cmin) / (4 * dif)
= (A + B + C + D) / (4 * dif) - (Cmin / dif)
This actually cuts down the execution time by almost a factor of 10 on my computer:
%timeit doTheMath(tmp_df, data2a, data2b) # 1000 loops, best of 3: 446 µs per loop
%timeit doTheMathNumba(tmp_df, data2a, data2b) # 10000 loops, best of 3: 59 µs per loop
There are certainly also other improvements, for example using a rolling min/max to calculate Bmax
and Cmin
, that would make at least part of the calculation run in O(sampleSize)
instead of O(samplesize * batchsize)
. This would also make it possible to reuse some of the (A + B + C + D) / (4 * dif) - (Cmin / dif)
calculations because if Cmin
and Bmax
don't change for the next sample these values don't differ. It's a bit complicated to do because the comparisons differ. But definitely possible! See here:
import time
import numpy as np
import numba as nb
@nb.njit
def doTheMathNumba(abcd, data2a, data2b, Bmax, Cmin):
diff = (Bmax - Cmin)
idiff = 1 / diff
quarter_idiff = 0.25 * idiff
sum_ = 0
for i in range(abcd.shape[0]):
val = abcd[i] * quarter_idiff - Cmin * idiff
if val <= data2a[i] and val >= data2b[i]:
sum_ += 1
return sum_
@nb.njit
def doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, resultArray):
found = 0
for rowNr in range(data1.shape[0]):
if(abcd[rowNr:rowNr + batchSize].shape[0] == batchSize):
result = doTheMathNumba(abcd[rowNr:rowNr + batchSize],
data2a, data2b, Bmaxs[rowNr], Cmins[rowNr])
if (result >= minimumLimit):
resultArray[found, 0] = rowNr
resultArray[found, 1] = result
found += 1
return resultArray[:found]
#Declare variables
batchSize = 2000
sampleSize = 50000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data
data1 = np.array(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.array(np.random.uniform(0, 1, batchSize)) #upper limit
data2b = np.array(np.random.uniform(0, 1, batchSize)) #lower limit
from scipy import ndimage
t0 = time.time()
abcd = np.sum(data1, axis=1)
Bmaxs = ndimage.maximum_filter1d(data1[:, 1],
size=batchSize,
origin=-((batchSize-1)//2-1)) # correction for even shapes
Cmins = ndimage.minimum_filter1d(data1[:, 2],
size=batchSize,
origin=-((batchSize-1)//2-1))
result = np.zeros((sampleSize, 2), dtype=np.int64)
doloop(data1, data2a, data2b, abcd, Bmaxs, Cmins, batchSize, sampleSize, minimumLimit, result)
print('Runtime:', time.time() - t0)
This gives me a Runtime: 0.759593152999878
(after numba compiled the functions!), while your original took had Runtime: 24.68975639343262
. Now we're 30 times faster!
With your sample size it still takes Runtime: 60.187848806381226
but that's not too bad, right?
And even if I haven't done this myself, numba says that it's possible to write "Numba for CUDA GPUs" and it doesn't seem to complicated.
Here is some code to demonstrate what is possible by just tweaking the algorithm. It's pure numpy but on the sample data you posted gives a roughly 35x speedup over the original version (~1,000,000 samples in ~2.5sec on my rather modest machine):
>>> result_dict = master('run')
[('load', 0.82578349113464355), ('precomp', 0.028138399124145508), ('max/min', 0.24333405494689941), ('ABCD', 0.015314102172851562), ('main', 1.3356468677520752)]
TOTAL 2.44821691513
Tweaks used:
These are more or less routine. That leaves the comparison with data2a/b which is expensive O(NK) where N is the number of samples and K is the size of the window. Here one can take advantage of the relatively well-behaved data. Using the running min/max one can create variants of data2a/b that can be used to test a range of window offsets at a time, if the test fails all these offsets can be ruled out immediately, otherwise the range is bisected.
import numpy as np
import time
# global variables; they will hold the precomputed pre-screening filters
preA, preB = {}, {}
CHUNK_SIZES = None
def sliding_argmax(data, K=2000):
"""compute the argmax of data over a sliding window of width K
returns:
indices -- indices into data
switches -- window offsets at which the maximum changes
(strictly speaking: where the index of the maximum changes)
excludes 0 but includes maximum offset (len(data)-K+1)
see last line of compute_pre_screening_filter for a recipe to convert
this representation to the vector of maxima
"""
N = len(data)
last = np.argmax(data[:K])
indices = [last]
while indices[-1] <= N - 1:
ge = np.where(data[last + 1 : last + K + 1] > data[last])[0]
if len(ge) == 0:
if last + K >= N:
break
last += 1 + np.argmax(data[last + 1 : last + K + 1])
indices.append(last)
else:
last += 1 + ge[0]
indices.append(last)
indices = np.array(indices)
switches = np.where(data[indices[1:]] > data[indices[:-1]],
indices[1:] + (1-K), indices[:-1] + 1)
return indices, np.r_[switches, [len(data)-K+1]]
def compute_pre_screening_filter(bound, n_offs):
"""compute pre-screening filter for point-wise upper bound
given a K-vector of upper bounds B and K+n_offs-1-vector data
compute K+n_offs-1-vector filter such that for each index j
if for any offset 0 <= o < n_offs and index 0 <= i < K such that
o + i = j, the inequality B_i >= data_j holds then filter_j >= data_j
therefore the number of data points below filter is an upper bound for
the maximum number of points below bound in any K-window in data
"""
pad_l, pad_r = np.min(bound[:n_offs-1]), np.min(bound[1-n_offs:])
padded = np.r_[pad_l+np.zeros(n_offs-1,), bound, pad_r+np.zeros(n_offs-1,)]
indices, switches = sliding_argmax(padded, n_offs)
return padded[indices].repeat(np.diff(np.r_[[0], switches]))
def compute_all_pre_screening_filters(upper, lower, min_chnk=5, dyads=6):
"""compute upper and lower pre-screening filters for data blocks of
sizes K+n_offs-1 where
n_offs = min_chnk, 2min_chnk, ..., 2^(dyads-1)min_chnk
the result is stored in global variables preA and preB
"""
global CHUNK_SIZES
CHUNK_SIZES = min_chnk * 2**np.arange(dyads)
preA[1] = upper
preB[1] = lower
for n in CHUNK_SIZES:
preA[n] = compute_pre_screening_filter(upper, n)
preB[n] = -compute_pre_screening_filter(-lower, n)
def test_bounds(block, counts, threshold=400):
"""test whether the windows fitting in the data block 'block' fall
within the bounds using pre-screening for efficient bulk rejection
array 'counts' will be overwritten with the counts of compliant samples
note that accurate counts will only be returned for above threshold
windows, because the analysis of bulk rejected windows is short-circuited
also note that bulk rejection only works for 'well behaved' data and
for example not on random numbers
"""
N = len(counts)
K = len(preA[1])
r = N % CHUNK_SIZES[0]
# chop up N into as large as possible chunks with matching pre computed
# filters
# start with small and work upwards
counts[:r] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
(block[l:l+K] >= preB[1]))
for l in range(r)]
def bisect(block, counts):
M = len(counts)
cnts = np.count_nonzero((block <= preA[M]) & (block >= preB[M]))
if cnts < threshold:
counts[:] = cnts
return
elif M == CHUNK_SIZES[0]:
counts[:] = [np.count_nonzero((block[l:l+K] <= preA[1]) &
(block[l:l+K] >= preB[1]))
for l in range(M)]
else:
M //= 2
bisect(block[:-M], counts[:M])
bisect(block[M:], counts[M:])
N = N // CHUNK_SIZES[0]
for M in CHUNK_SIZES:
if N % 2:
bisect(block[r:r+M+K-1], counts[r:r+M])
r += M
elif N == 0:
return
N //= 2
else:
for j in range(2*N):
bisect(block[r:r+M+K-1], counts[r:r+M])
r += M
def analyse(data, use_pre_screening=True, min_chnk=5, dyads=6,
threshold=400):
samples, upper, lower = data
N, K = samples.shape[0], upper.shape[0]
times = [time.time()]
if use_pre_screening:
compute_all_pre_screening_filters(upper, lower, min_chnk, dyads)
times.append(time.time())
# compute switching points of max and min for running normalisation
upper_inds, upper_swp = sliding_argmax(samples[:, 1], K)
lower_inds, lower_swp = sliding_argmax(-samples[:, 2], K)
times.append(time.time())
# sum columns
ABCD = samples.sum(axis=-1)
times.append(time.time())
counts = np.empty((N-K+1,), dtype=int)
# main loop
# loop variables:
offs = 0
u_ind, u_scale, u_swp = 0, samples[upper_inds[0], 1], upper_swp[0]
l_ind, l_scale, l_swp = 0, samples[lower_inds[0], 2], lower_swp[0]
while True:
# check which is switching next, min(C) or max(B)
if u_swp > l_swp:
# greedily take the largest block possible such that dif and Cmin
# do not change
block = (ABCD[offs:l_swp+K-1] - 4*l_scale) \
* (0.25 / (u_scale-l_scale))
if use_pre_screening:
test_bounds(block, counts[offs:l_swp], threshold=threshold)
else:
counts[offs:l_swp] = [
np.count_nonzero((block[l:l+K] <= upper) &
(block[l:l+K] >= lower))
for l in range(l_swp - offs)]
# book keeping
l_ind += 1
offs = l_swp
l_swp = lower_swp[l_ind]
l_scale = samples[lower_inds[l_ind], 2]
else:
block = (ABCD[offs:u_swp+K-1] - 4*l_scale) \
* (0.25 / (u_scale-l_scale))
if use_pre_screening:
test_bounds(block, counts[offs:u_swp], threshold=threshold)
else:
counts[offs:u_swp] = [
np.count_nonzero((block[l:l+K] <= upper) &
(block[l:l+K] >= lower))
for l in range(u_swp - offs)]
u_ind += 1
if u_ind == len(upper_inds):
assert u_swp == N-K+1
break
offs = u_swp
u_swp = upper_swp[u_ind]
u_scale = samples[upper_inds[u_ind], 1]
times.append(time.time())
return {'counts': counts, 'valid': np.where(counts >= 400)[0],
'timings': np.diff(times)}
def master(mode='calibrate', data='fake', use_pre_screening=True, nrep=3,
min_chnk=None, dyads=None):
t = time.time()
if data in ('fake', 'load'):
data1 = np.loadtxt('data1.csv', delimiter=';', skiprows=1,
usecols=[1,2,3,4])
data2a = np.loadtxt('data2a.csv', delimiter=';', skiprows=1,
usecols=[1])
data2b = np.loadtxt('data2b.csv', delimiter=';', skiprows=1,
usecols=[1])
if data == 'fake':
data1 = np.tile(data1, (10, 1))
threshold = 400
elif data == 'random':
data1 = np.random.random((102000, 4))
data2b = np.random.random(2000)
data2a = np.random.random(2000)
threshold = 490
if use_pre_screening or mode == 'calibrate':
print('WARNING: pre-screening not efficient on artificial data')
else:
raise ValueError("data mode {} not recognised".format(data))
data = data1, data2a, data2b
t_load = time.time() - t
if mode == 'calibrate':
min_chnk = (2, 3, 4, 5, 6) if min_chnk is None else min_chnk
dyads = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) if dyads is None else dyads
timings = np.zeros((len(min_chnk), len(dyads)))
print('max bisect ' + ' '.join([
' n.a.' if dy == 0 else '{:7d}'.format(dy) for dy in dyads]),
end='')
for i, mc in enumerate(min_chnk):
print('\nmin chunk {}'.format(mc), end=' ')
for j, dy in enumerate(dyads):
for k in range(nrep):
if dy == 0: # no pre-screening
timings[i, j] += analyse(
data, False, mc, dy, threshold)['timings'][3]
else:
timings[i, j] += analyse(
data, True, mc, dy, threshold)['timings'][3]
timings[i, j] /= nrep
print('{:7.3f}'.format(timings[i, j]), end=' ', flush=True)
best_mc, best_dy = np.unravel_index(np.argmin(timings.ravel()),
timings.shape)
print('\nbest', min_chnk[best_mc], dyads[best_dy])
return timings, min_chnk[best_mc], dyads[best_dy]
if mode == 'run':
min_chnk = 2 if min_chnk is None else min_chnk
dyads = 5 if dyads is None else dyads
res = analyse(data, use_pre_screening, min_chnk, dyads, threshold)
times = np.r_[[t_load], res['timings']]
print(list(zip(('load', 'precomp', 'max/min', 'ABCD', 'main'), times)))
print('TOTAL', times.sum())
return res
This is technically off-topic (not GPU) but I'm sure you'll be interested.
There is one obvious and rather large saving:
Precompute A + B + C + D
(not in the loop, on the whole data: data1.sum(axis=-1)
), because abcd = ((A+B+C+D) - 4Cmin) / (4dif)
. This will save quite a few ops.
Surprised nobody spotted that one before ;-)
Edit:
There is another thing, though I suspect that's only in your example, not in your real data:
As it stands roughly half of data2a
will be smaller than data2b
. In these places your conditions on abcd
cannot be both True, so you needn't even compute abcd
there.
Edit:
One more tweak I used below but forgot to mention: If you compute the max (or min) over a moving window. When you move one point to the right, say, how likely is the max to change? There are only two things that can change it: the new point on the right is larger (happens roughly once in windowlength times, and even if it happens, you immediately know the new max) or the old max falls off the window on the left (also happens roughly once in windowlength times). Only in this last case you have to search the entire window for the new max.
Edit:
Couldn't resist giving it a try in tensorflow. I don't have a GPU, so you yourself have to test it for speed. Put "gpu" for "cpu" on the marked line.
On cpu it is about half as fast as your original implementation (i.e. without Divakar's tweaks). Note that I've taken the liberty of changing the inputs from matrix to plain array. Currently tensorflow is a bit of a moving target, so make sure you have the right version. I used Python3.6 and tf 0.12.1 If you do a pip3 install tensorflow-gpu today it should might work.
import numpy as np
import time
import tensorflow as tf
# currently the max/min code is sequential
# thus
parallel_iterations = 1
# but you can put this in a separate loop, precompute and then try and run
# the remainder of doTheMathTF with a larger parallel_iterations
# tensorflow is quite capricious about its data types
ddf = tf.float64
ddi = tf.int32
def worker(data1, data2a, data2b):
###################################
# CHANGE cpu to gpu in next line! #
###################################
with tf.device('/cpu:0'):
g = tf.Graph ()
with g.as_default():
ABCD = tf.constant(data1.sum(axis=-1), dtype=ddf)
B = tf.constant(data1[:, 1], dtype=ddf)
C = tf.constant(data1[:, 2], dtype=ddf)
window = tf.constant(len(data2a))
N = tf.constant(data1.shape[0] - len(data2a) + 1, dtype=ddi)
data2a = tf.constant(data2a, dtype=ddf)
data2b = tf.constant(data2b, dtype=ddf)
def doTheMathTF(i, Bmax, Bmaxind, Cmin, Cminind, out):
# most of the time we can keep the old max/min
Bmaxind = tf.cond(Bmaxind<i,
lambda: i + tf.to_int32(
tf.argmax(B[i:i+window], axis=0)),
lambda: tf.cond(Bmax>B[i+window-1],
lambda: Bmaxind,
lambda: i+window-1))
Cminind = tf.cond(Cminind<i,
lambda: i + tf.to_int32(
tf.argmin(C[i:i+window], axis=0)),
lambda: tf.cond(Cmin<C[i+window-1],
lambda: Cminind,
lambda: i+window-1))
Bmax = B[Bmaxind]
Cmin = C[Cminind]
abcd = (ABCD[i:i+window] - 4 * Cmin) * (1 / (4 * (Bmax-Cmin)))
out = out.write(i, tf.to_int32(
tf.count_nonzero(tf.logical_and(abcd <= data2a,
abcd >= data2b))))
return i + 1, Bmax, Bmaxind, Cmin, Cminind, out
with tf.Session(graph=g) as sess:
i, Bmaxind, Bmax, Cminind, Cmin, out = tf.while_loop(
lambda i, _1, _2, _3, _4, _5: i<N, doTheMathTF,
(tf.Variable(0, dtype=ddi), tf.Variable(0.0, dtype=ddf),
tf.Variable(-1, dtype=ddi),
tf.Variable(0.0, dtype=ddf), tf.Variable(-1, dtype=ddi),
tf.TensorArray(ddi, size=N)),
shape_invariants=None,
parallel_iterations=parallel_iterations,
back_prop=False)
out = out.pack()
sess.run(tf.initialize_all_variables())
out, = sess.run((out,))
return out
#Declare variables
batchSize = 2000
sampleSize = 50000#00
resultArray = []
#Create Sample Data
data1 = np.random.uniform(1, 100, (sampleSize + batchSize, 4))
data2a = np.random.uniform(0, 1, (batchSize,))
data2b = np.random.uniform(0, 1, (batchSize,))
t0 = time.time()
out = worker(data1, data2a, data2b)
print('Runtime (tensorflow):', time.time() - t0)
good_indices, = np.where(out >= 490)
res_tf = np.c_[good_indices, out[good_indices]]
def doTheMath(tmpData1, data2a, data2b):
A = tmpData1[:, 0]
B = tmpData1[:,1]
C = tmpData1[:,2]
D = tmpData1[:,3]
Bmax = B.max()
Cmin = C.min()
dif = (Bmax - Cmin)
abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
#Loop through the data
t0 = time.time()
for rowNr in range(sampleSize+1):
tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
result = doTheMath(tmp_df, data2a, data2b)
if (result >= 490):
resultArray.append([rowNr , result])
print('Runtime (original):', time.time() - t0)
print(np.alltrue(np.array(resultArray)==res_tf))