问题
First I would like to mention I am new to Tensorflow, I am working on OCR project using CTC (Connectionist Temporal Classification) and LSTM (Long Short Term Memory). I have done the training and when i am trying to restore session I found an error that is commonly posted on the internet but different analysis has been provided.
Error is :-
2018-01-10 13:42:43.179534: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-01-10 13:42:43.179939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Quadro M4000
major: 5 minor: 2 memoryClockRate (GHz) 0.7725
pciBusID 0000:00:05.0
Total memory: 7.93GiB
Free memory: 7.56GiB
2018-01-10 13:42:43.179974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2018-01-10 13:42:43.179986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2018-01-10 13:42:43.180002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M4000, pci bus id: 0000:00:05.0)
2018-01-10 13:42:43.316563: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.319682: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.332996: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.333927: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.334583: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.379830: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.380081: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.380189: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.380188: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.380343: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.380554: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
2018-01-10 13:42:43.415117: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
Traceback (most recent call last):
File "detect.py", line 62, in <module>
print(detect(test_inputs, test_targets, test_seq_len))
File "detect.py", line 23, in detect
saver.restore(sess,'models/ocr.model-100000')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1548, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
Caused by op u'save/Assign_1', defined at:
File "detect.py", line 62, in <module>
print(detect(test_inputs, test_targets, test_seq_len))
File "detect.py", line 20, in detect
saver = tf.train.Saver()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1139, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1170, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 419, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 155, in restore
self.op.get_shape().is_fully_defined())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 271, in assign
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [48] rhs shape= [5,5,1,48]
[[Node: save/Assign_1 = Assign[T=DT_FLOAT, _class=["loc:@Variable_1"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Variable_1, save/RestoreV2_1/_25)]]
I have analysed the error related to the function saver.restore(sess,'models/ocr.model-100000')
This is mainly related to many things, what I have done till now is:
removed all the checkpoint that been saved from the pervious training and start all over again but still this wasn't enough
I have used the function
print_tensors_in_checkpoint_file
provided by Tensorflow and the checkpoint looks fine to me.
This is the output:
Variable []
Variable_1 [5, 5, 1, 48]
Variable_1/Momentum [5, 5, 1, 48]
Variable_2 [48]
Variable_2/Momentum [48]
Variable_3 [5, 5, 48, 64]
Variable_3/Momentum [5, 5, 48, 64]
Variable_4 [64]
Variable_4/Momentum [64]
Variable_5 [5, 5, 64, 128]
Variable_5/Momentum [5, 5, 64, 128]
Variable_6 [128]
Variable_6/Momentum [128]
Variable_7 [65536, 256]
Variable_7/Momentum [65536, 256]
Variable_8 [256]
Variable_8/Momentum [256]
W [128, 38]
W/Momentum [128, 38]
b [38]
b/Momentum [38]
rnn/multi_rnn_cell/cell_0/lstm_cell/bias [512]
rnn/multi_rnn_cell/cell_0/lstm_cell/bias/Momentum [512]
rnn/multi_rnn_cell/cell_0/lstm_cell/kernel [129, 512]
rnn/multi_rnn_cell/cell_0/lstm_cell/kernel/Momentum [129, 512]
rnn/multi_rnn_cell/cell_1/lstm_cell/bias [512]
rnn/multi_rnn_cell/cell_1/lstm_cell/bias/Momentum [512]
rnn/multi_rnn_cell/cell_1/lstm_cell/kernel [256, 512]
rnn/multi_rnn_cell/cell_1/lstm_cell/kernel/Momentum [256, 512]
[<tf.Variable 'Variable:0' shape=(5, 5, 1, 48) dtype=float32_ref>, <tf.Variable 'Variable_1:0' shape=(48,) dtype=float32_ref>, <tf.Variable 'Variable_2:0' shape=(5, 5, 48, 64) dtype=float32_ref>, <tf.Variable 'Variable_3:0' shape=(64,) dtype=float32_ref>, <tf.Variable 'Variable_4:0' shape=(5, 5, 64, 128) dtype=float32_ref>, <tf.Variable 'Variable_5:0' shape=(128,) dtype=float32_ref>, <tf.Variable 'Variable_6:0' shape=(65536, 256) dtype=float32_ref>, <tf.Variable 'Variable_7:0' shape=(256,) dtype=float32_ref>, <tf.Variable 'rnn/multi_rnn_cell/cell_0/lstm_cell/kernel:0' shape=(129, 512) dtype=float32_ref>, <tf.Variable 'rnn/multi_rnn_cell/cell_0/lstm_cell/bias:0' shape=(512,) dtype=float32_ref>, <tf.Variable 'rnn/multi_rnn_cell/cell_1/lstm_cell/kernel:0' shape=(256, 512) dtype=float32_ref>, <tf.Variable 'rnn/multi_rnn_cell/cell_1/lstm_cell/bias:0' shape=(512,) dtype=float32_ref>, <tf.Variable 'W:0' shape=(128, 38) dtype=float32_ref>, <tf.Variable 'b:0' shape=(38,) dtype=float32_ref>]
My curious is about how the saver gets the size and how to debug the code.
回答1:
Looks like you've changed the order of variables in your graph definition at some point:
[5, 5, 1, 48]
is the shape of Variable_1
and [48]
is the shape of Variable_2
in the saved checkpoint.
The naming indicates that you didn't give explicit names to the variables, as a result they got the names Variable
, Variable_1
, Variable_2
, ... The suffix is determined according to the order in which tensorflow sees them, so if you swap two variables in code, they will get different names. After that you can no longer import earlier saved checkpoints, because tensorflow sees a different tensor under the same name.
The best practice is to specify the name of each variable explicitly via name
attribute:
W_conv1 = `tf.Variable(..., name='W_conv1')
b_conv1 = `tf.Variable(..., name='b_conv1')
...
This way the code will be more robust to small perturbations in the model.
来源:https://stackoverflow.com/questions/48186569/invalidargumenterror-in-restore-assign-requires-shapes-of-both-tensors-to-match