i'm training some Music Data on a LSTM-RNN in Tensorflow and encountered some Problem with GPU-Memory-Allocation which i don't understand: I encounter an OOM when there actually seems to be just about enough VRAM still available. Some background: I'm working on Ubuntu Gnome 16.04, using a GTX1060 6GB, Intel Xeon E3-1231V3 and 8GB RAM. So now first the part of the error-message which i can understand, in the and i will add the whole error message in the end again for anyone who might ask for it to help:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 5484118016
InUse: 4670717952
MaxInUse: 5484118016
NumAllocs: 29
MaxAllocSize: 1612612352
W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000]
So i can read that there is a maximum of 5484118016 bytes to be allocated, 4670717952 bytes are allready in use, and another 777.72MB = 775720000 bytes are to be allocated. 5484118016 bytes - 4670717952 bytes - 775720000 bytes = 37680064 bytes according to my calculator. So there should still be 37MB of free VRAM after allocating the space for the new Tensor he wants to push in there. This seems also to be quite legit to me, as Tensorflow would probably (i guess?) not try to allocate more VRAM than there is still available and just put the rest of the data on hold in RAM or something.
Now i guess there is just some big error in my thinking, but i would be quite gratefull if someone could explain to me, what this error is. The obvious solving-strategy to my problem is to just make my batches a bit smaller, having them each at around 1.5GB probably just is too big. Still i would love to know what the actual problem is there.
edit: I found something telling me to try:
config = tf.ConfigProto() config.gpu_options.allocator_type = 'BFC' with tf.Session(config = config) as s:
which still does not work, but as the tensorflow documentation lacks any explanation of what
gpu_options.allocator_type = 'BFC'
would be, i would love to ask you guys.
Adding the rest of the error message for anyone interested:
Sorry for the long copy/paste, but maybe someone would need/want to see it,
Thank you so much in advance, Leon
(gputensorflow) leon@ljksUbuntu:~/Tensorflow$ python Netzwerk_v0.5.1_gamma.py I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate (GHz) 1.7335 pciBusID 0000:01:00.0 Total memory: 5.93GiB Free memory: 5.40GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0) I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 1, Chunks in use: 0 147.20MiB allocated for chunks. 147.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 1, Chunks in use: 0 628.52MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 775.72MiB was 256.00MiB, Chunk State: I tensorflow/core/common_runtime/bfc_allocator.cc:666] Size: 628.52MiB | Requested Size: 0B | in_use: 0, prev: Size: 147.20MiB | Requested Size: 147.20MiB | in_use: 1, next: Size: 54.8KiB | Requested Size: 54.7KiB | in_use: 1 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000000 of size 1280 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000500 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000600 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e100 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e200 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208018f00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019000 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019100 of size 813400064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387d1100 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387dec00 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b11e00 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cb00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cc00 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cd00 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102722d4d00 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b615a00 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620700 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620800 of size 256 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620900 of size 813400064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102abdd8900 of size 813400064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc590900 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc59e400 of size 56064 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc5abf00 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102e58df100 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec12300 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec1d000 of size 44288 I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec27d00 of size 1612612352 I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x1024ae4ff00 of size 659049984 I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x102722e2800 of size 154350080 I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: Limit: 5484118016 InUse: 4670717952 MaxInUse: 5484118016 NumAllocs: 29 MaxAllocSize: 1612612352 W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000] Traceback (most recent call last): File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call return fn(*args) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn status, run_metadata) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/contextlib.py", line 66, in __exit__ next(self.gen) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Netzwerk_v0.5.1_gamma.py", line 171, in session.run(tf.global_variables_initializer()) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] Caused by op 'rnn/basic_lstm_cell/weights/Initializer/random_uniform', defined at: File "Netzwerk_v0.5.1_gamma.py", line 94, in initial_state=initial_state, time_major=False) # time_major = FALSE currently File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 545, in dynamic_rnn dtype=dtype) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 712, in _dynamic_rnn_loop swap_memory=swap_memory) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2626, in while_loop result = context.BuildLoop(cond, body, loop_vars, shape_invariants) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2459, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2409, in _BuildLoop body_result = body(*packed_vars_for_body) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 697, in _time_step (output, new_state) = call_cell() File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 683, in call_cell = lambda: cell(input_t, state) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 179, in __call__ concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 747, in _linear "weights", [total_arg_size, output_size], dtype=dtype) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable custom_getter=custom_getter) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable custom_getter=custom_getter) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable validate_shape=validate_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter caching_device=caching_device, validate_shape=validate_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable validate_shape=validate_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 226, in __init__ expected_shape=expected_shape) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args initial_value(), name="initial_value", dtype=dtype) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in shape.as_list(), dtype=dtype, partition_info=partition_info) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py", line 360, in __call__ dtype, seed=self.seed) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 246, in random_uniform return math_ops.add(rnd * (maxval - minval), minval, name=name) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add result = _op_def_lib.apply_op("Add", x=x, y=y, name=name) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__ self._traceback = _extract_stack() ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[14525,14000] [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]