Can someone help me rewrite this one function (the doTheMath
function) to do the calculations on the GPU? I used a few good days now trying to get my head
This is technically off-topic (not GPU) but I'm sure you'll be interested.
There is one obvious and rather large saving:
Precompute A + B + C + D
(not in the loop, on the whole data: data1.sum(axis=-1)
), because abcd = ((A+B+C+D) - 4Cmin) / (4dif)
. This will save quite a few ops.
Surprised nobody spotted that one before ;-)
Edit:
There is another thing, though I suspect that's only in your example, not in your real data:
As it stands roughly half of data2a
will be smaller than data2b
. In these places your conditions on abcd
cannot be both True, so you needn't even compute abcd
there.
Edit:
One more tweak I used below but forgot to mention: If you compute the max (or min) over a moving window. When you move one point to the right, say, how likely is the max to change? There are only two things that can change it: the new point on the right is larger (happens roughly once in windowlength times, and even if it happens, you immediately know the new max) or the old max falls off the window on the left (also happens roughly once in windowlength times). Only in this last case you have to search the entire window for the new max.
Edit:
Couldn't resist giving it a try in tensorflow. I don't have a GPU, so you yourself have to test it for speed. Put "gpu" for "cpu" on the marked line.
On cpu it is about half as fast as your original implementation (i.e. without Divakar's tweaks). Note that I've taken the liberty of changing the inputs from matrix to plain array. Currently tensorflow is a bit of a moving target, so make sure you have the right version. I used Python3.6 and tf 0.12.1 If you do a pip3 install tensorflow-gpu today it should might work.
import numpy as np
import time
import tensorflow as tf
# currently the max/min code is sequential
# thus
parallel_iterations = 1
# but you can put this in a separate loop, precompute and then try and run
# the remainder of doTheMathTF with a larger parallel_iterations
# tensorflow is quite capricious about its data types
ddf = tf.float64
ddi = tf.int32
def worker(data1, data2a, data2b):
###################################
# CHANGE cpu to gpu in next line! #
###################################
with tf.device('/cpu:0'):
g = tf.Graph ()
with g.as_default():
ABCD = tf.constant(data1.sum(axis=-1), dtype=ddf)
B = tf.constant(data1[:, 1], dtype=ddf)
C = tf.constant(data1[:, 2], dtype=ddf)
window = tf.constant(len(data2a))
N = tf.constant(data1.shape[0] - len(data2a) + 1, dtype=ddi)
data2a = tf.constant(data2a, dtype=ddf)
data2b = tf.constant(data2b, dtype=ddf)
def doTheMathTF(i, Bmax, Bmaxind, Cmin, Cminind, out):
# most of the time we can keep the old max/min
Bmaxind = tf.cond(BmaxindB[i+window-1],
lambda: Bmaxind,
lambda: i+window-1))
Cminind = tf.cond(Cminind= data2b))))
return i + 1, Bmax, Bmaxind, Cmin, Cminind, out
with tf.Session(graph=g) as sess:
i, Bmaxind, Bmax, Cminind, Cmin, out = tf.while_loop(
lambda i, _1, _2, _3, _4, _5: i= 490)
res_tf = np.c_[good_indices, out[good_indices]]
def doTheMath(tmpData1, data2a, data2b):
A = tmpData1[:, 0]
B = tmpData1[:,1]
C = tmpData1[:,2]
D = tmpData1[:,3]
Bmax = B.max()
Cmin = C.min()
dif = (Bmax - Cmin)
abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()
#Loop through the data
t0 = time.time()
for rowNr in range(sampleSize+1):
tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
result = doTheMath(tmp_df, data2a, data2b)
if (result >= 490):
resultArray.append([rowNr , result])
print('Runtime (original):', time.time() - t0)
print(np.alltrue(np.array(resultArray)==res_tf))