I am trying to train my model on a GPU instead of a CPU on an AWS p2.xlarge instance from my Jupyter Notebook. I am using tensorflow-gpu backend (only tensorflow-gpu
That happens because you're using LSTM layers.
Tensorflow's implementation for LSTM layers is not that great. The reason is probably that recurrent calculations are not parallel calculations, and GPUs are great for parallel processing.
I confirmed that by my own experience:
This article about using GPUs and tensorflow also confirms that:
You may try using the new CuDNNLSTM, which seems prepared specially for using GPUs.
I never tested it, but you'll most probably get a better performance with this.
Another thing that I haven't tested, and I'm not sure it's designed for that reason, but I suspect it is: you can put unroll=True
in your LSTM layers. With that, I suspect the recurrent calculations will be transformed in parallel ones.
Try to use some bigger value for batch_size
in model.fit
, because the default is 32
. Increase it until you get 100% CPU utilization.
Following suggestion from @dgumo, you can also put your data into /run/shm
. This is an in-memory file system, which allows to access data in fastest possible way. Alternatively, you can ensure that your data resides at least on SSD. For example in /tmp
.
The bottleneck in your case is transferring data to and from the GPU. The best way to speed up your computation (and maximize your GPU usage) is to load as much of your data as your memory can hold at once. Since you have plenty of memory, you can put all your data at once, by doing:
model.fit(X_np, y_np, epochs=100, validation_split=0.25, batch_size=X_np.shape[0])
(You should also probably increase the number of epochs when doing this).
Note however that there are advantages to minibatching (e.g. better handling of local minima), so you should probably consider choosing a batch_size somewhere in between.