How to Pause / Resume Training in Tensorflow

后端未结

关注

 3  1798

This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official

相关标签:

3条回答

旧巷少年郎

2021-01-05 04:04
As described by Hamed, the right way to do it on tensorflow is
```
    saver=tf.train.Saver()
    save_path='checkpoints/'
    -----> while training you can store using
    saver.save(sess=session,save_path=save_path)
    -----> and restore
    saver.restore(sess=session,save_path=save_path)
```
this will load the model where you last saved it and will the training(if you want) from there only.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-01-05 04:06
Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.

Things to keep in Mind:
1. Make sure you are saving your checkpoints. In tf.train.saver() you can specify max_checkpoints to keep.
2. Specify the directory of the checkpoints in the tf.train.MonitoredTrainingSession(checkpoint='dir_path',save_checkpoint_secs=). Based on the save_checkpoint_secs argument, the above session would keep saving and updating the checkpoints.
3. When you constantly keep saving the checkpoints, above function, looks for the latest checkpoint and resumes training from there.
0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2021-01-05 04:19
TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver for it's Vars.

So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.
```
saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'
```
which later you can use
```
tf.train.Saver.restore(sess, save_path)
```
to restore your saved Vars.

Saver Usage
0 讨论(0)
发布评论:

提交评论
- 加载中...

How to Pause / Resume Training in Tensorflow

Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.

Things to keep in Mind: