问题
I'm using the pre-built AI Platform Jupyter Notebook instances to train a model with a single Tesla K80 card. The issue is that I don't believe the model is actually training on the GPU.
nvidia-smi
returns the following during training:
No Running Processes Found
Not the "No Running Process Found" yet "Volatile GPU Usage" is 100%. Something seems strange...
...And the training is excruciatingly slow.
A few days ago, I was having issues with the GPU not being released after each notebook run. When this occurred I would receive a OOM (Out of memory error). This required me to go into the console every time, find the GPU running process PID and use kill -9 before re-running the notebook. However, today, I can't get the GPU to run at all? It never shows a running process.
I've tried 2 different GCP AI Platform Notebook instances (both of the available tensorflow version options) with no luck. Am I missing something with these "pre-built" instances.
Pre-Built AI Platform Notebook Section
Just to clarify, I did not build my own instance and then install access to Jupyter notebooks. Instead, I used the built-in Notebook instance option under the AI Platform submenu.
Do I still need to configure a setting somewhere or install a library to continue using/reset my chosen GPU? I was under the impression that the virtual machine was already loaded with the Nvidia stack and should be plug and play with GPUs.
Thoughts?
EDIT: Here is a full video of the issue as requested --> https://www.youtube.com/watch?v=N5Zx_ZrrtKE&feature=youtu.be
回答1:
Generally speaking, you'll want to try to debug issues like this using the smallest possible bit of code that could reproduce your error. That removes many possible causes for the issue you're seeing.
In this case, you can check if your GPUs are being used by running this code (copied from the TensorFlow 2.0 GPU instructions):
import tensorflow as tf
print("GPU Available: ", tf.test.is_gpu_available())
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
Running it on the same TF 2.0 Notebook gives me the output:
GPU Available: True
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
That right there shows that it's using the GPUs
Similarly, if you need more evidence, running nvidia-smi gives the output:
jupyter@tf2:~$ nvidia-smi
Tue Jul 30 00:59:58 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 36C P0 58W / 149W | 10900MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7852 C /usr/bin/python3 10887MiB |
+-----------------------------------------------------------------------------+
So why isn't your code using GPUs? You're using a library someone else wrote, probably for tutorial purposes. Most likely those library functions are doing something that is causing CPUs to be used instead of GPUs.
You'll want to debug that code directly.
来源:https://stackoverflow.com/questions/57140254/google-cloud-ai-platform-notebook-instance-wont-use-gpu-with-jupyter