I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Ut
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow
method to prepare batches with augmented images.