问题
I have a tf model to train with a untrainable embedding layer whose size is larger than 10GB. I do not want to save this variable to my checkpoint file, because it takes too much time and space.
Is it possible for me to save ckpt without this untrainable variable and use tf.estimator normally?
When training the model in distributed mode, parameter server will save this variable and it takes too much time to synchronize variable. Is it possible to avoid this problem? Values of this variable does not change forever, parameter server need to do nothing to it, I think.
Here is what I have tried:
I tried to use a tf.constant instead of tf.variable to save this embedding, but I cannot build such a constant because of the proto limitation.
emb_np = np.load(file_name) embedding_table = tf.constant( value=emb_np, name=embedding_name, shape=[vocab_size, embedding_size])
Error message is like this:
"Cannot create a tensor proto whose content is larger than 2GB."
In fact, I cannot get a constant larger than 2GB now.
Replace the default saver of estimator.
saver=tf.train.Saver( var_to_save, sharded=True, max_to_keep=20, keep_checkpoint_every_n_hours=( 100000), defer_build=False, save_relative_paths=True)
I remove this variable from the default saver of estimator, and this variable will be initialized from another ckpt when build the model, but the checkpoint cannot be used when training the model from existing ckpt and when evaluate the model. Error messages are like this:
RuntimeError: Init operations did not make model ready for local_init. Init op: group_deps, init fn: None, error: Variables not initialized: dense/kernel/Adagrad, dense/bias/Adagrad, dense_1/kernel/Adagrad, dense_1/bias/Adagrad, dense_2/kernel/Adagrad, dense_2/bias/Adagrad, w1/Adagrad
I think the reason is that "defer_build" parameter must be set false when use my variable list to define the tf.train.Saver, but I do not know how to build it in my code using estimator.
- I thought parameter server need not to manage untrainable variables, but according to the memory usage of PS, untrainable variables are saved here. Are there any options to change this?
来源:https://stackoverflow.com/questions/55841510/how-to-remove-untrainable-variables-when-saving-checkpoint-with-tensorflow-estim