I\'m training a XOR neural network via back-propagation using stochastic gradient descent. The weights of the neural network are initialized to random values between -0.5 an
Yes, neural networks can get stuck in local minima, depending on the error surface. However this abstract suggests that there are no local minima in the error surface of the XOR problem. However I cannot get to the full text, so I cannot verify what the authors did to proove this and how it applies to your problem.
There also might be other factors leading to this problem. For example if you descend very fast at some steep valley, if you just use a first order gradient descent, you might get to the opposite slope and bounce back and forth all the time. You could try also giving the average change over all weights at each iteration, to test if you realy have a "stuck" network, or rather one, which just has run into a limit cycle.
You should first try fiddling with your parameters (learning rate, momentum if you implemented it etc). If you can make the problem go away, by changing parameters, your algorithm is probably ok.
I encountered the same issue and found that using the activation function 1.7159*tanh(2/3*x) described in LeCun's "Efficient Backprop" paper helps. This is presumably because that function does not saturate around the target values {-1, 1}, whereas regular tanh does.
Poor gradient descent with excessively large steps as described by LiKao is one possible problem. Another is that there are very flat regions of the XOR error landscape which means that it takes a very long time to converge, and in fact the gradient may be so weak that descent algorithm doesn't pull you in the right direction.
These two papers look at 2-1-1 and 2-2-1 XOR landscapes. One uses a "cross entropy" error function which I don't know. In the first they declare there are no local minima but in the second they say there are local minima at infinity - basically when weights run off to very large values. So for the second case, their results suggest if you don't start off near "enough" true minima you may get trapped at the infinite points. They also say that other analyses of 2-2-1 XOR networks that show no local minima are not contradicted by their results because of particular definitions.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4770
http://www.ncbi.nlm.nih.gov/pubmed/12662806