How to force Theano to parallelize an operation on GPU (test case: numpy.bincount)

问题

I am looking for possibility to speed up computation of bincount using GPU.

Reference code in numpy:

x_new = numpy.random.randint(0, 1000, 1000000)
%timeit numpy.bincount(x_new)
100 loops, best of 3: 2.33 ms per loop

I want to measure only speed of operation, not the time spent on passing array, so I create a shared variable:

x = theano.shared(numpy.random.randint(0, 1000, 1000000))
theano_bincount = theano.function([], T.extra_ops.bincount(x))

This operation is of course highly parallelizable, but in practice on GPU this code is times slower than CPU version:

%timeit theano_bincount()
10 loops, best of 3: 25.7 ms per loop

So my questions are:

What could be the reason for such low performance?
Can I write parallel version of bincount using theano?

回答1:

I think you cannot increase this operation on the GPU further unless you can somehow manually tell Theano to do in in a parallelized manner, which seems not to be possible. On the GPU, the computations that are not to be done in parallel will be done at the same speed or slower compared to CPU.

Quote from Daniel Renshaw:

To an extent, Theano expects you to focus more on what you want computed rather than on how you want it computed. The idea is that the Theano optimizing compiler will automatically parallelize as much as possible (either on GPU or on CPU using OpenMP).

And another quote:

You need to be able to specify your computation in terms of Theano operations. If those operations can be parallelized on the GPU, they should be parallelized automatically.

Quote from Theano's webpage:

Indexing, dimension-shuffling and constant-time reshaping will be equally fast on GPU as on CPU.

Summation over rows/columns of tensors can be a little slower on the GPU than on the CPU.

I think the only thing you can do is to set the openmp flag to True in your .theanorc file.

Anyway I tried an idea. It does not work for now, but hopefully someone can help us make it work. If worked, you might be able to parallelize the operation on the GPU. The code below tries to do EVERYTHING in the GPU with CUDA API. However, there are two bottle-necks not allowing the operation take place: 1) Currently (as of Jan. 4th, 2016) Theano and CUDA do not support any operations on any data type rather than float32 and 2) T.extra_ops.bincount() only works with int data types. So it might be the bottleneck for Theano not being able to fully parallelize the operation.

import theano.tensor as T
from theano import shared, Out, function
import numpy as np
import theano.sandbox.cuda.basic_ops as sbasic

shared_var = shared(np.random.randint(0, 1000, 1000000).astype(T.config.floatX), borrow = True)
x = T.vector('x');
computeFunc = T.extra_ops.bincount(sbasic.as_cuda_ndarray_variable(T.cast(x, 'int16')))
func = function([], Out(sbasic.gpu_from_host(computeFunc), borrow = True), givens = {x: shared_var})

Sources

1- How do I set many elements in parallel in theano

2- http://deeplearning.net/software/theano/tutorial/using_gpu.html#what-can-be-accelerated-on-the-gpu

3- http://deeplearning.net/software/theano/tutorial/multi_cores.html

来源：https://stackoverflow.com/questions/34520471/how-to-force-theano-to-parallelize-an-operation-on-gpu-test-case-numpy-bincoun

标签

python