Intermediate layer makes tensorflow optimizer to stop working

后端 未结 2 582
时光取名叫无心
时光取名叫无心 2020-12-05 06:08

This graph trains a simple signal identity encoder, and in fact shows that the weights are being evolved by the optimizer:

import tensorflow as tf
import num         


        
相关标签:
2条回答
  • 2020-12-05 06:16

    TL;DR: the deeper the neural network becomes, the more you should pay attention to the gradient flow (see this discussion of "vanishing gradients"). One particular case is variables initialization.


    Problem analysis

    I've added tensorboard summaries for the variables and gradients into both of your scripts and got the following:

    2-layer network

    3-layer network

    The charts show the distributions of W:0 variable (the first layer) and how they are changed from 0 epoch to 1000 (clickable). Indeed, we can see, the rate of change is much higher in a 2-layer network. But I'd like to pay attention to the gradient distribution, which is much closer to 0 in a 3-layer network (first variance is around 0.005, the second one is around 0.000002, i.e. 1000 times smaller). This is the vanishing gradient problem.

    Here's the helper code if you're interested:

    for g, v in grads_and_vars:
      tf.summary.histogram(v.name, v)
      tf.summary.histogram(v.name + '_grad', g)
    
    merged = tf.summary.merge_all()
    writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())
    
    ...
    
    _, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
    if i % 10 == 0:
      writer.add_summary(summary, global_step=i)
    

    Solution

    All deep networks suffer from this to some extent and there is no universal solution that will auto-magically fix any network. But there are some techniques that can push it in the right direction. Initialization is one of them.

    I replaced your normal initialization with:

    W_init = tf.contrib.layers.xavier_initializer()
    b_init = tf.constant_initializer(0.1)
    

    There are lots of tutorials on Xavier init, you can take a look at this one, for example. Note that I set the bias init to be slightly positive to make sure that ReLu outputs are positive for the most of neurons, at least in the beginning.

    This changed the picture immediately:

    The weights are still not moving quite as fast as before, but they are moving (note the scale of W:0 values) and the gradients distribution became much less peaked at 0, thus much better.

    Of course, it's not the end. To improve it further, you should implement the full autoencoder, because currently the loss is affected by the [0,0] element reconstruction, so most outputs aren't used in optimization. You can also play with different optimizers (Adam would be my choice) and the learning rates.

    0 讨论(0)
  • 2020-12-05 06:37

    That looks very exciting. Where exactly does this code belong? I've only recently discovered TensorBoard

    is this in callbacks somehow:

      for g, v in grads_and_vars:
      tf.summary.histogram(v.name, v)
      tf.summary.histogram(v.name + '_grad', g)
    
    merged = tf.summary.merge_all()
    writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())
    

    is this after fiting:

    _, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
    if i % 10 == 0:
      writer.add_summary(summary, global_step=i)
    
    0 讨论(0)
提交回复
热议问题