How to remove untrainable variables when saving checkpoint with tensorflow estimator?

青春壹個敷衍的年華 提交于 2019-12-11 14:22:14

问题


I have a tf model to train with a untrainable embedding layer whose size is larger than 10GB. I do not want to save this variable to my checkpoint file, because it takes too much time and space.

Is it possible for me to save ckpt without this untrainable variable and use tf.estimator normally?

When training the model in distributed mode, parameter server will save this variable and it takes too much time to synchronize variable. Is it possible to avoid this problem? Values of this variable does not change forever, parameter server need to do nothing to it, I think.

Here is what I have tried:

  1. I tried to use a tf.constant instead of tf.variable to save this embedding, but I cannot build such a constant because of the proto limitation.

    emb_np = np.load(file_name)
    embedding_table = tf.constant(
        value=emb_np,
        name=embedding_name,
        shape=[vocab_size, embedding_size])
    

    Error message is like this:

    "Cannot create a tensor proto whose content is larger than 2GB."
    

In fact, I cannot get a constant larger than 2GB now.

  1. Replace the default saver of estimator.

    saver=tf.train.Saver(
          var_to_save,
          sharded=True,
          max_to_keep=20,
          keep_checkpoint_every_n_hours=(
              100000),
          defer_build=False,
          save_relative_paths=True)
    

    I remove this variable from the default saver of estimator, and this variable will be initialized from another ckpt when build the model, but the checkpoint cannot be used when training the model from existing ckpt and when evaluate the model. Error messages are like this:

    RuntimeError: Init operations did not make model ready for local_init.  Init op: group_deps, init fn: None, error: Variables not initialized: dense/kernel/Adagrad, dense/bias/Adagrad, dense_1/kernel/Adagrad, dense_1/bias/Adagrad, dense_2/kernel/Adagrad, dense_2/bias/Adagrad, w1/Adagrad
    

I think the reason is that "defer_build" parameter must be set false when use my variable list to define the tf.train.Saver, but I do not know how to build it in my code using estimator.

  1. I thought parameter server need not to manage untrainable variables, but according to the memory usage of PS, untrainable variables are saved here. Are there any options to change this?

来源:https://stackoverflow.com/questions/55841510/how-to-remove-untrainable-variables-when-saving-checkpoint-with-tensorflow-estim

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!