问题
I want to have train/evaluate the ssd_mobile_v1_coco on my own dataset at the same time in Object Detection API
.
However, when I simply try to do so, I am faced with GPU memory being nearly full and thus the evaluation script fails to start.
Here are the commands I use for training and then evaluation:
Training script is called in one terminal pane like this :
python3 train.py \
--logtostderr \
--train_dir=training_ssd_mobile_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
That runs fine, training works... then I try to run the evaluation script in the second terminal pane :
python3 eval.py \
--logtostderr \
--checkpoint_dir=training_ssd_mobile_caltech \
--eval_dir=eval_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
It fails with the following error :
python3 eval.py \
--logtostderr \
--checkpoint_dir=training_ssd_mobile_caltech \
--eval_dir=eval_caltech \
--pipeline_config_path=ssd_mobilenet_v1_coco_2017_11_17/ssd_mobilenet_v1_focal_loss_coco.config
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
2018-02-28 18:40:00.302271: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-28 18:40:00.412808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-28 18:40:00.413217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 93.00MiB
2018-02-28 18:40:00.413424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-02-28 18:40:00.957090: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 43.00M (45088768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:00.957919: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 38.70M (40580096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
INFO:tensorflow:Restoring parameters from training_ssd_mobile_caltech/model.ckpt-4775
2018-02-28 18:40:02.274830: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:02.278599: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.280515: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.281958: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 8.17M (8566528 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-02-28 18:40:12.282082: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.75MiB. Current allocation summary follows.
2018-02-28 18:40:12.282160: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256): Total Chunks: 190, Chunks in use: 190. 47.5KiB allocated for chunks. 47.5KiB in use in bin. 11.8KiB client-requested in use in bin.
2018-02-28 18:40:12.282251: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512): Total Chunks: 70, Chunks in use: 70. 35.0KiB allocated for chunks. 35.0KiB in use in bin. 35.0KiB client-requested in use in bin.
[.......................................]2018-02-28 18:40:12.290959: I tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use chunks: 29.83MiB
2018-02-28 18:40:12.290971: I tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats:
Limit: 45088768
InUse: 31284736
MaxInUse: 32368384
NumAllocs: 808
MaxAllocSize: 5796864
2018-02-28 18:40:12.291022: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************xx*********xx**_*__****______***********************************************xx
2018-02-28 18:40:12.291044: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
WARNING:root:The following classes have no ground truth examples: 1
/home/mm/models/research/object_detection/utils/metrics.py:144: RuntimeWarning: invalid value encountered in true_divide
num_images_correctly_detected_per_class / num_gt_imgs_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:710: RuntimeWarning: Mean of empty slice
mean_ap = np.nanmean(self.average_precision_per_class)
/home/mm/models/research/object_detection/utils/object_detection_evaluation.py:711: RuntimeWarning: Mean of empty slice
mean_corloc = np.nanmean(self.corloc_per_class)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval.py", line 146, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "eval.py", line 142, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "/home/mm/models/research/object_detection/evaluator.py", line 240, in evaluate
save_graph_dir=(eval_dir if eval_config.save_graph else ''))
File "/home/mm/models/research/object_detection/eval_util.py", line 407, in repeated_checkpoint_run
save_graph_dir)
File "/home/mm/models/research/object_detection/eval_util.py", line 286, in _run_checkpoint_once
result_dict = batch_processor(tensor_dict, sess, batch, counters)
File "/home/mm/models/research/object_detection/evaluator.py", line 183, in _process_batch
result_dict = sess.run(tensor_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D', defined at:
File "eval.py", line 146, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "eval.py", line 142, in main
FLAGS.checkpoint_dir, FLAGS.eval_dir)
File "/home/mm/models/research/object_detection/evaluator.py", line 161, in evaluate
ignore_groundtruth=eval_config.ignore_groundtruth)
File "/home/mm/models/research/object_detection/evaluator.py", line 72, in _extract_prediction_tensors
prediction_dict = model.predict(preprocessed_image, true_image_shapes)
File "/home/mm/models/research/object_detection/meta_architectures/ssd_meta_arch.py", line 334, in predict
preprocessed_inputs)
File "/home/mm/models/research/object_detection/models/ssd_mobilenet_v1_feature_extractor.py", line 112, in extract_features
scope=scope)
File "/home/mm/models/research/slim/nets/mobilenet_v1.py", line 232, in mobilenet_v1_base
scope=end_point)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 762, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 652, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py", line 167, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 838, in __call__
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 502, in __call__
return self.call(inp, filter)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 190, in __call__
name=self.name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 639, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,32,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, FeatureExtractor/MobilenetV1/Conv2d_0/weights/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1/_469 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1068_Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/ClipToWindow/Gather/Gather_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Prior to initiating the eval.py
TF training has all the GPU memory allocated in advance and therefore I cant figure out how to have both of them running at the same time, or at least have the ODA, run evaluation in specific intervals.
Therefore Is it possible in first place to have evaluation run simultaneously with training? if so How is it done ?
System information
What is the top-level directory of the model you are using: object_detection
Have I written custom code: not yet...
OS Platform and Distribution : Linux Ubuntu 16.04 LTS
TensorFlow installed from (source or binary): pip3 tensorflow-gpu
TensorFlow version (use command below): 1.5.0
CUDA/cuDNN version: 9.0/7.0
GPU model and memory: GTX 1080, 8Gb
回答1:
To force the eval job to run on your CPU (and prevent it from taking precious GPU-memory), create one virtualenv where you install tensorflow-gpu which you use for training (named e.g. virtual_tf_gpu), and another one where you install tensorflow WITHOUT gpu support (e.g. virtual_tf). Activate your two virtualenvs in two separate terminal windows and start training in your GPU-supported environment and evaluation in your CPU-supported environment.
Good luck!!!
回答2:
One simple way to do this is to add CUDA_VISIBILE_DEVICES before your command
CUDA_VISIBLE_DEVICES="" python eval.py --logtostderr --pipeline_config_path=multires.config --checkpoint_dir=/train_dir/ --eval_dir=eval_dir/
which will prevent your evaluation script from seeing any GPU, and it should fall back to CPU automatically.
来源:https://stackoverflow.com/questions/49033008/how-to-train-and-evaluate-simultaneously-in-object-detection-api