How to update model parameters with accumulated gradients?

前端 未结 6 2056
孤独总比滥情好
孤独总比滥情好 2020-12-05 05:14

I\'m using TensorFlow to build a deep learning model. And new to TensorFlow.

Due to some reason, my model has limited batch size, then this limited batch-size will m

相关标签:
6条回答
  • 2020-12-05 05:56

    I had the same problem and just figured it out.

    First get symbolic gradients, then define accumulated gradients as tf.Variables. (It seems that tf.global_variables_initializer() has to be run before defining grads_accum. I got errors otherwise, not sure why.)

    tvars = tf.trainable_variables()
    optimizer = tf.train.GradientDescentOptimizer(lr)
    grads = tf.gradients(cost, tvars)
    
    # initialize
    tf.local_variables_initializer().run()
    tf.global_variables_initializer().run()
    
    grads_accum = [tf.Variable(tf.zeros_like(v)) for v in grads] 
    update_op = optimizer.apply_gradients(zip(grads_accum, tvars)) 
    

    In training you can accumulate gradients (saved in gradients_accum) at each batch, and update the model after running the 64-th batch:

    feed_dict = dict()
    for i, _grads in enumerate(gradients_accum):
        feed_dict[grads_accum[i]] = _grads
    sess.run(fetches=[update_op], feed_dict=feed_dict) 
    

    You can refer to tensorflow/tensorflow/python/training/optimizer_test.py for example usage, particularly this function: testGradientsAsVariables().

    Hope it helps.

    0 讨论(0)
  • 2020-12-05 06:05

    The previous solutions do not compute the average of the accumulated gradients, which may lead to instability in training. I've modified the above code, which should solve this problem.

    # Fetch a list of our network's trainable parameters.
    trainable_vars = tf.trainable_variables()
    
    # Create variables to store accumulated gradients
    accumulators = [
        tf.Variable(
            tf.zeros_like(tv.initialized_value()),
            trainable=False
        ) for tv in trainable_vars
    ]
    
    # Create a variable for counting the number of accumulations
    accumulation_counter = tf.Variable(0.0, trainable=False)
    
    # Compute gradients; grad_pairs contains (gradient, variable) pairs
    grad_pairs = optimizer.compute_gradients(loss, trainable_vars)
    
    # Create operations which add a variable's gradient to its accumulator.
    accumulate_ops = [
        accumulator.assign_add(
            grad
        ) for (accumulator, (grad, var)) in zip(accumulators, grad_pairs)
    ]
    
    # The final accumulation operation is to increment the counter
    accumulate_ops.append(accumulation_counter.assign_add(1.0))
    
    # Update trainable variables by applying the accumulated gradients
    # divided by the counter. Note: apply_gradients takes in a list of 
    # (grad, var) pairs
    train_step = optimizer.apply_gradients(
        [(accumulator / accumulation_counter, var) \
            for (accumulator, (grad, var)) in zip(accumulators, grad_pairs)]
    )
    
    # Accumulators must be zeroed once the accumulated gradient is applied.
    zero_ops = [
        accumulator.assign(
            tf.zeros_like(tv)
        ) for (accumulator, tv) in zip(accumulators, trainable_vars)
    ]
    
    # Add one last op for zeroing the counter
    zero_ops.append(accumulation_counter.assign(0.0))
    

    This code is used in the same manner as that provided by @weixsong.

    0 讨论(0)
  • 2020-12-05 06:09

    Tensorflow 2.0 Compatible Answer: In line with the weixsong's Answer mentioned above and the explanation provided in Tensorflow Website, mentioned below is the code for Accumulating Gradients in Tensorflow Version 2.0:

    def train(epochs):
      for epoch in range(epochs):
        for (batch, (images, labels)) in enumerate(dataset):
           with tf.GradientTape() as tape:
            logits = mnist_model(images, training=True)
            tvs = mnist_model.trainable_variables
            accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
            zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
            loss_value = loss_object(labels, logits)
    
           loss_history.append(loss_value.numpy().mean())
           grads = tape.gradient(loss_value, tvs)
           #print(grads[0].shape)
           #print(accum_vars[0].shape)
           accum_ops = [accum_vars[i].assign_add(grad) for i, grad in enumerate(grads)]
    
    
    
        optimizer.apply_gradients(zip(grads, mnist_model.trainable_variables))
        print ('Epoch {} finished'.format(epoch))
    
    # call the above function    
    train(epochs = 3)
    

    Complete code can be found in this Github Gist.

    0 讨论(0)
  • 2020-12-05 06:11

    The method you posted seems to fail if I dont give the feed_dict again in the sess.run(train_step). I don't know why require the feed_dict, but it is possible that run again all the accumulator adding with the last example repeated. This is what I had to do in my case:

                self.session.run(zero_ops)
                for i in range(0, mini_batch):
    
                    self.session.run(accum_ops, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})
    
                self.session.run(norm_acums, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})
                self.session.run(train_op, feed_dict={self.ph_X: imgs_feed[np.newaxis, i, :, :, :], self.ph_Y: flow_labels[np.newaxis, i, :, :, :], self.keep_prob: self.dropout})
    

    And for normalize the gradient I understand that is only divide the accumulated gradietn by the batchsize so I only add a new op

    norm_accums = [accum_op/float(batchsize) for accum_op in accum_ops]
    

    Did someone have that same issue of feed_dict?

    *UPDATE As I suppused that is wrong, it runs the all graph again with the last example in the batch. This little code test that

    import numpy as np
    import tensorflow as tf
    ph = tf.placeholder(dtype=tf.float32, shape=[])
    var_accum = tf.get_variable("acum", shape=[], 
    initializer=tf.zeros_initializer())
    acum = tf.assign_add(var_accum, ph)
    divide = acum/5.0
    init = tf.global_variables_initializer()
        with tf.Session() as sess:
        sess.run(init)
        for i in range(5):
             sess.run(acum, feed_dict={ph: 2.0})
    
    c = sess.run([divide], feed_dict={ph: 2.0})
    #10/5 = 2
    print(c)
    #but it gives 2.4, that is 12/5, so sums one more time
    

    I figured out how to solve this. So, tensorflow has conditional operations. I put the accumulation in one branch and the last accumulation with normalization and update in another branch. My code is a mess, but for fast check of I'm saying I let a little code of an example of use.

    import numpy as np
    import tensorflow as tf
    
    ph = tf.placeholder(dtype=tf.float32, shape=[])
    #placeholder for conditional braching in the graph
    condph = tf.placeholder(dtype=tf.bool, shape=[])
    
    var_accum = tf.get_variable("acum", shape=[], initializer=tf.zeros_initializer())
    
    accum_op = tf.assign_add(var_accum, ph)
    
    #function when condition of condph is True
    def truefn():
       return accum_op
    #function when condtion of condph is False
    def falsefn():
       div = accum_op/5.0
       return div
    
    #return the conditional operation
    cond = tf.cond(condph, truefn, falsefn)
    
    init = tf.global_variables_initializer()
    
    with tf.Session() as sess:
       sess.run(init)
       for i in range(4):
           #run only accumulation
           sess.run(cond, feed_dict={ph: 2.0, condph: True})
       #run acumulation and divition
       c = sess.run(cond, feed_dict={ph: 2.0, condph: False})
    
    print(c)
    #now gives 2
    

    *IMPORTANT NOTE: Forget everything didnt work. The optimizers drop a failure.

    0 讨论(0)
  • 2020-12-05 06:11

    You can use Pytorch instead of Tensorflow as it allows the user to accumulate gradients during training

    0 讨论(0)
  • 2020-12-05 06:15

    I found a solution here: https://github.com/tensorflow/tensorflow/issues/3994#event-766328647

    opt = tf.train.AdamOptimizer()
    tvs = tf.trainable_variables()
    accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]                                        
    zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
    gvs = opt.compute_gradients(rmse, tvs)
    accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]
    train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)])
    

    In the training loop:

    while True:
        sess.run(zero_ops)
        for i in xrange(n_minibatches):
            sess.run(accum_ops, feed_dict=dict(X: Xs[i], y: ys[i]))
        sess.run(train_step)
    

    But this code seems not very clean and pretty, does anyone know how to optimize these code?

    0 讨论(0)
提交回复
热议问题