Is my implementation of stochastic gradient descent correct?

前端 未结 3 1877
梦如初夏
梦如初夏 2020-12-28 22:20

I am trying to develop stochastic gradient descent, but I don\'t know if it is 100% correct.

  • The cost generated by my stochastic gradient descent algorithm is
相关标签:
3条回答
  • 2020-12-28 22:50

    The learning rate is always between 0 to 1. If you set the learning rate very high then it follows the desired to a lesser extent, because of skipping. So take a small learning rate even though it takes more time. The output result will be more convincing.

    0 讨论(0)
  • 2020-12-28 22:56

    There is a reason for small value of the learning rate. Briefly, when the learning rates decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins-Siegmund theorem.

    Robbins, Herbert; Siegmund, David O. (1971). "A convergence theorem for non negative almost supermartingales and some applications". In Rustagi, Jagdish S. Optimizing Methods in Statistics. Academic Press

    0 讨论(0)
  • 2020-12-28 23:01

    This is pretty much ok. If you are worried about choosing the appropriate learning rate alpha, you should think about applying a line search method.

    Line search is a method which chooses an optimal learning rate for gradient descent at every iteration, which is better than using fixed learning rate throughout the whole optimization process. Optimal value for learning rate alpha is one which locally (from current theta in the direction of the negative gradient) minimizes cost function.

    At each iteration of the gradient descent, start from the learning rate alpha = 0 and gradually increase alpha by the fixed step deltaAlpha = 0.01, for example. Recalculate parameters theta and evaluate the cost function. Since the cost function is convex, by increasing alpha (that is, by moving in the direction of negative gradient) cost function will first start decreasing and then (at some moment) increasing. At that moment stop the line search and take the last alpha before cost function started increasing. Now update the parameters theta with that alpha. In case that the cost function never starts increasing, stop at alpha = 1.

    Note: For big regularization factors (lambda = 100, lambda = 1000) it is possible that deltaAlpha is too big and that gradient descent diverges. If that is the case, decrease deltaAlpha 10 times (deltaAlpha = 0.001, deltaAlpha = 0.0001) until you get to the appropriate deltaAlpha for which gradient descent converges.

    Also, you should think about using some terminating condition other than the number of iterations, e.g. when difference between cost functions in two subsequent iterations becomes small enough (less than some epsilon).

    0 讨论(0)
提交回复
热议问题