问题
I am using Tensorflow's object detection API, with my custom dataset. I am currently training "ssd_mobilenet_v1_coco"
Everytime I try, training starts but training stops silently and randomly without error message. (Using COMMAND below, Command prompt shows the number of steps to some extent.) It is seems that GPU(CUDA) also stops.
I've already tried changing batch_size("64" shows best score)and "ssd_mobilenet_v2_coco"
Is this parameter(like "sample_1_of_n_eval_examples=1") or GPU problem?
OS:windows10 Tensorflow ver:1.15 Python:3.6 CPU:i9-9900K GPU:NVIDIA GeForce RTX 2080
COMMAND that I used
python object_detection/model_main.py --pipeline_config_path="C:\Users\MYPATH\models\model\ssd_mobilenet_v1_coco.config" --model_dir="C:\Users\MYPATH\models\model" --num_train_steps=2000 --sample_1_of_n_eval_examples=1 --alsologtostderr
MESSAGE
INFO:tensorflow:Done calling model_fn.
I0106 17:49:29.545947 15104 estimator.py:1150] Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
I0106 17:49:29.545947 15104 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
I0106 17:49:32.188141 15104 monitored_session.py:240] Graph was finalized.
2020-01-06 17:49:32.200573: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-01-06 17:49:32.205758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-01-06 17:49:32.229166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2020-01-06 17:49:32.232539: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-01-06 17:49:32.236216: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-01-06 17:49:32.239801: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-01-06 17:49:32.242368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-01-06 17:49:32.246706: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-01-06 17:49:32.250779: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-01-06 17:49:32.258807: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-01-06 17:49:32.261581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-06 17:49:32.700705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-06 17:49:32.703645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-01-06 17:49:32.705343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-01-06 17:49:32.707345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6271 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
I0106 17:49:35.342885 15104 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0106 17:49:35.702204 15104 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\mypath\models\model\model.ckpt.
I0106 17:49:42.856755 15104 basic_session_run_hooks.py:606] Saving checkpoints for 0 into C:\Users\mypath\models\model\model.ckpt.
2020-01-06 17:49:51.489601: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-01-06 17:49:52.410981: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-01-06 17:49:52.445252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
INFO:tensorflow:loss = 33.134163, step = 0
I0106 17:49:55.059146 15104 basic_session_run_hooks.py:262] loss = 33.134163, step = 0
INFO:tensorflow:global_step/sec: 2.58675
I0106 17:50:33.717694 15104 basic_session_run_hooks.py:692] global_step/sec: 2.58675
INFO:tensorflow:loss = 9.563588, step = 100 (38.659 sec)
I0106 17:50:33.717694 15104 basic_session_run_hooks.py:260] loss = 9.563588, step = 100 (38.659 sec)
来源:https://stackoverflow.com/questions/59609378/tensorflow-object-detection-api-training-fails-silently