Gradient Descent has a problem of Local Minima. We need run gradient descent exponential times for to find global minima.
Can anybody tell me about any alternatives of
It has been demonstrated that being stuck in a local minima is very unlikely to occur in a high dimensional space because having all derivatives equals to zero in every dimensions is unlikely. (Source Andrew NG Coursera DeepLearning Specialization) That also explain why gradient descent works so well.
See my masters thesis for a very similar list:
You might also want to have a look at my article about optimization basics and at Alec Radfords nice gifs: 1 and 2, e.g.
Other interesting resources are:
I think all of the posted optimization algorithms have some scenarios where they have advantages. The general trade-offs are:
Extreme Learning Machines Essentially they are a neural network where the weights connecting the inputs to the hidden nodes are assigned randomly and never updated. The weights between the hidden nodes and the outputs are learned in a single step by solving a linear equation using matrix inverse.
This is more a problem to do with the function being minimized than the method used, if finding the true global minimum is important, then use a method such a simulated annealing. This will be able to find the global minimum, but may take a very long time to do so.
In the case of neural nets, local minima are not necessarily that much of a problem. Some of the local minima are due to the fact that you can get a functionally identical model by permuting the hidden layer units, or negating the inputs and output weights of the network etc. Also if the local minima is only slightly non-optimal, then the difference in performance will be minimal and so it won't really matter. Lastly, and this is an important point, the key problem in fitting a neural network is over-fitting, so aggressively searching for the global minima of the cost function is likely to result in overfitting and a model that performs poorly.
Adding a regularisation term, e.g. weight decay, can help to smooth out the cost function, which can reduce the problem of local minima a little, and is something I would recommend anyway as a means of avoiding overfitting.
The best method however of avoiding local minima in neural networks is to use a Gaussian Process model (or a Radial Basis Function neural network), which have fewer problems with local minima.
Local minima is a property of the solution space, not the optimization method. It is a problem with neural networks in general. Convex methods, such as SVMs, have gained in popularity largely because of it.