I\'m running tensorflow-gpu on Windows 10 using a simple MINST neural network program. When it tries to run, it encounters a CUBLAS_STATUS_ALLOC_FAILED
error. A
There are at least 2 distinct problems here. The first is when a previously run python process is subsequently re-run, and GPU memory has not been freed from the previous run. You can tell this is happening, as when the python process appears it is instantly consuming a huge amount of RAM and will fail when it goes to acquire some more. In the attached screen grab ~6GB is acquired on startup. Check the GPU memory by using the task manager in Windows, the Dedicated GPU Memory Column under the details tab. In this case, reboot the PC, as the problem is caused by running out of GPU memory. TF is designed not to free memory during a session as it will lead to fragmentation, so it looks like the IPython/Python session is holding the TF instance and not freeing the memory from the last run. In my case using Pycharm with an IPython session, repeatedly running it eventually leads to all my RAM being grabbed on startup statically, with little left for growth dynamically.
The second problem is when the GPU device is configured wrong. Depending on the TF version and how many devices you are using, you may need to set the GPU memory to have the same policy across multiple devices. The policy is to either allow the GPU memory to grow during a session, or grab as much as possible on startup. Various fixes are listed above, choose the one that fits the TF version you're using, and whether you have >1 device or not.
For TensorFlow 2.2 none of the solutions above worked when the CUBLAS_STATUS_ALLOC_FAILED problem was encountered. Found a solution on https://www.tensorflow.org/guide/gpu:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
I ran this code before any further calculations are made and found that the same code that produced CUBLAS error before now worked in same session. The sample code above is a specific example that sets the memory growth across a number of physical GPUs but it also solves the memory expansion problem.
I found this solution works
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto(
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
# device_count = {'GPU': 1}
)
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
set_session(session)
None of these fixes worked for me, as it seems that the structure of the tensorflow libraries have changed. For Tensorflow 2.0
, the only fix that worked for me was as under Limiting GPU memory growth
on this page https://www.tensorflow.org/guide/gpu
For completeness and future-proofing, here's the solution from the docs - I imagine changing memory_limit
may be necessary for some people - 1 GB was fine for my case.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
for keras:
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
set_session(session)
In my case, a stale python process was consuming memory. I killed it through task manager, and things are back to normal.