问题
For my thesis i am trying to evaluate the impact of different parameters on my active learning object detector with tensorflow (v 1.14).
Therefore i am using the faster_rcnn_inception_v2_coco standard config from the model zoo and a fixed random.seed(1).
To make sure i have a working baseline experiment i tried to run the object detector two times with the same dataset, learning time, poolingsize and so forth.
Anyhow the two plotted graphs after 20 active learning cycles are quite different as you can see here: Is it possible to ensure a comparable neural net performance? If yes, how to setup a scientific experiment setup, to compare parameter changes outcomes like learning rate, learning time (its a constraint in our active learning cycle!) poolingsize, ...
回答1:
To achieve determinism when training on CPU, the following should be sufficient:
1. SET ALL SEEDS
SEED = 123
os.environ['PYTHONHASHSEED']=str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.set_random_seed(SEED)
2. LIMIT CPU THREADS TO ONE
session_config.intra_op_parallelism_threads = 1
session_config.inter_op_parallelism_threads = 1
3. DATASET SHARDING
If you are using tf.data.Dataset
, then make sure the number of shards is limited to one.
4. GRADIENT GATING
For deterministic functionality, some types of models may require gate_gradients=tf.train.Optimizer.GATE_OP
in the session config.
5. HOROVOD
If you are training with more than two GPUs using Horovod, like so,
os.environ['HOROVOD_FUSION_THRESHOLD']='0'
To more clearly check for determinism between runs, I recommend the method I have documented here. I also recommend using this approach to confirm that the initial weights (before step one of training) are exactly the same between runs.
For the latest information on determinism in TensorFlow (with a focus on determinism when using GPUs), please take a look the tensorflow-determinism project which NVIDIA is kindly paying me to drive.
来源:https://stackoverflow.com/questions/59032574/how-to-ensure-neural-net-performance-comparability