Tensorflow crashes with CUBLAS_STATUS_ALLOC_FAILED

前端未结

关注

 10  1151

I\'m running tensorflow-gpu on Windows 10 using a simple MINST neural network program. When it tries to run, it encounters a CUBLAS_STATUS_ALLOC_FAILED error. A

相关标签:

10条回答

既然无缘

2020-12-03 01:24

There are at least 2 distinct problems here. The first is when a previously run python process is subsequently re-run, and GPU memory has not been freed from the previous run. You can tell this is happening, as when the python process appears it is instantly consuming a huge amount of RAM and will fail when it goes to acquire some more. In the attached screen grab ~6GB is acquired on startup. Check the GPU memory by using the task manager in Windows, the Dedicated GPU Memory Column under the details tab. In this case, reboot the PC, as the problem is caused by running out of GPU memory. TF is designed not to free memory during a session as it will lead to fragmentation, so it looks like the IPython/Python session is holding the TF instance and not freeing the memory from the last run. In my case using Pycharm with an IPython session, repeatedly running it eventually leads to all my RAM being grabbed on startup statically, with little left for growth dynamically.

The second problem is when the GPU device is configured wrong. Depending on the TF version and how many devices you are using, you may need to set the GPU memory to have the same policy across multiple devices. The policy is to either allow the GPU memory to grow during a session, or grab as much as possible on startup. Various fixes are listed above, choose the one that fits the TF version you're using, and whether you have >1 device or not.

0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2020-12-03 01:25
For TensorFlow 2.2 none of the solutions above worked when the CUBLAS_STATUS_ALLOC_FAILED problem was encountered. Found a solution on https://www.tensorflow.org/guide/gpu:
```
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)
```
I ran this code before any further calculations are made and found that the same code that produced CUBLAS error before now worked in same session. The sample code above is a specific example that sets the memory growth across a number of physical GPUs but it also solves the memory expansion problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...

爱一瞬间的悲伤

2020-12-03 01:27

I found this solution works

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto(
    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
    # device_count = {'GPU': 1}
)
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
set_session(session)

0 讨论(0)

温柔的废话

2020-12-03 01:28

None of these fixes worked for me, as it seems that the structure of the tensorflow libraries have changed. For Tensorflow 2.0, the only fix that worked for me was as under Limiting GPU memory growth on this page https://www.tensorflow.org/guide/gpu

For completeness and future-proofing, here's the solution from the docs - I imagine changing memory_limit may be necessary for some people - 1 GB was fine for my case.

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

0 讨论(0)

渐次进展

2020-12-03 01:28

for keras:

from keras.backend.tensorflow_backend import set_session
import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
set_session(session)

0 讨论(0)

自闭症患者

2020-12-03 01:28

In my case, a stale python process was consuming memory. I killed it through task manager, and things are back to normal.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页