How to Pause / Resume Training in Tensorflow

后端 未结 3 1798
鱼传尺愫
鱼传尺愫 2021-01-05 03:15

This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official

相关标签:
3条回答
  • 2021-01-05 04:04

    As described by Hamed, the right way to do it on tensorflow is

        saver=tf.train.Saver()
        save_path='checkpoints/'
        -----> while training you can store using
        saver.save(sess=session,save_path=save_path)
        -----> and restore
        saver.restore(sess=session,save_path=save_path)
    

    this will load the model where you last saved it and will the training(if you want) from there only.

    0 讨论(0)
  • 2021-01-05 04:06

    Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.

    Things to keep in Mind:

    1. Make sure you are saving your checkpoints. In tf.train.saver() you can specify max_checkpoints to keep.
    2. Specify the directory of the checkpoints in the tf.train.MonitoredTrainingSession(checkpoint='dir_path',save_checkpoint_secs=). Based on the save_checkpoint_secs argument, the above session would keep saving and updating the checkpoints.
    3. When you constantly keep saving the checkpoints, above function, looks for the latest checkpoint and resumes training from there.
    0 讨论(0)
  • 2021-01-05 04:19

    TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver for it's Vars.

    So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.

    saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
    ...
    saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'
    

    which later you can use

    tf.train.Saver.restore(sess, save_path)
    

    to restore your saved Vars.

    Saver Usage

    0 讨论(0)
提交回复
热议问题