I\'m using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I\'m tranining a relatively simple Convolutional Neural Network, during tra
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic: https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/ https://www.imgtec.com/blog/measuring-gpu-compute-performance/
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.