What are alternatives of Gradient Descent?

前端未结

关注

 5  1508

Gradient Descent has a problem of Local Minima. We need run gradient descent exponential times for to find global minima.

Can anybody tell me about any alternatives of

相关标签:

5条回答

终归单人心

2021-01-31 22:59

It has been demonstrated that being stuck in a local minima is very unlikely to occur in a high dimensional space because having all derivatives equals to zero in every dimensions is unlikely. (Source Andrew NG Coursera DeepLearning Specialization) That also explain why gradient descent works so well.

0 讨论(0)
发布评论:

提交评论
- 加载中...
太阳男子

2021-01-31 23:01
See my masters thesis for a very similar list:

Optimization algorithms for neural networks
- Gradient based
  - Flavours of gradient descent (only first order gradient):
    - Stochastic gradient descent:
    - Mini-Batch gradient descent:
    - Learning Rate Scheduling:
      
      Momentum:
      
      RProp and the mini-batch version RMSProp
      
      AdaGrad
      
      Adadelta (paper)
      
      Exponential Decay Learning Rate
      
      Performance Scheduling
      
      Newbob Scheduling
    - Quickprop
    - Nesterov Accelerated Gradient (NAG): Explanation
  - Higher order gradients
    - Newton's method: Typically not possible
    - Quasi-Newton method
      
      BFGS
      
      L-BFGS
  - Unsure how it works
    - Adam (Adaptive Moment Estimation)
      
      AdaMax
    - Conjugate gradient
- Alternatives
  - Genetic algorithms
  - Simulated Annealing
  - Twiddle
  - Markov random fields (graphcut/mincut)
You might also want to have a look at my article about optimization basics and at Alec Radfords nice gifs: 1 and 2, e.g.

Other interesting resources are:
- An overview of gradient descent optimization algorithms
Trade-Offs

I think all of the posted optimization algorithms have some scenarios where they have advantages. The general trade-offs are:
- How much of an improvement do you get in one step?
- How fast can you calculate one step?
- How much data can the algorithm deal with?
- Is it guaranteed to find a local minimum?
- What requirements does the optimization algorithm have for your function? (e.g. to be once, twice or three times differentiable)
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2021-01-31 23:10

Extreme Learning Machines Essentially they are a neural network where the weights connecting the inputs to the hidden nodes are assigned randomly and never updated. The weights between the hidden nodes and the outputs are learned in a single step by solving a linear equation using matrix inverse.

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2021-01-31 23:14

This is more a problem to do with the function being minimized than the method used, if finding the true global minimum is important, then use a method such a simulated annealing. This will be able to find the global minimum, but may take a very long time to do so.

In the case of neural nets, local minima are not necessarily that much of a problem. Some of the local minima are due to the fact that you can get a functionally identical model by permuting the hidden layer units, or negating the inputs and output weights of the network etc. Also if the local minima is only slightly non-optimal, then the difference in performance will be minimal and so it won't really matter. Lastly, and this is an important point, the key problem in fitting a neural network is over-fitting, so aggressively searching for the global minima of the cost function is likely to result in overfitting and a model that performs poorly.

Adding a regularisation term, e.g. weight decay, can help to smooth out the cost function, which can reduce the problem of local minima a little, and is something I would recommend anyway as a means of avoiding overfitting.

The best method however of avoiding local minima in neural networks is to use a Gaussian Process model (or a Radial Basis Function neural network), which have fewer problems with local minima.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2021-01-31 23:16

Local minima is a property of the solution space, not the optimization method. It is a problem with neural networks in general. Convex methods, such as SVMs, have gained in popularity largely because of it.

0 讨论(0)
发布评论:

提交评论
- 加载中...

What are alternatives of Gradient Descent?

Optimization algorithms for neural networks

Trade-Offs