Multi-layer neural network back-propagation formula (using stochastic gradient descent)

Using the notations from Backpropagation calculus | Deep learning, chapter 4, I have this back-propagation code for a 4-layer (i.e. 2 hidden layers) neural network:

def sigmoid_prime(z): 
    return z * (1-z)  # because σ'(x) = σ(x) (1 - σ(x))

def train(self, input_vector, target_vector):
    a = np.array(input_vector, ndmin=2).T
    y = np.array(target_vector, ndmin=2).T

    # forward
    A = [a]  
    for k in range(3):
        a = sigmoid(np.dot(self.weights[k], a))  # zero bias here just for simplicity
        A.append(a)

    # Now A has 4 elements: the input vector + the 3 outputs vectors

    # back-propagation
    delta = a - y
    for k in [2, 1, 0]:
        tmp = delta * sigmoid_prime(A[k+1])
        delta = np.dot(self.weights[k].T, tmp)  # (1)  <---- HERE
        self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

It works, but:

the accuracy at the end (for my use case: MNIST digit recognition) is just ok, but not very good. It is much better (i.e. the convergence is much better) when the line (1) is replaced by:
```
delta = np.dot(self.weights[k].T, delta)  # (2)
```
the code from Machine Learning with Python: Training and Testing the Neural Network with MNIST data set also suggests:
```
delta = np.dot(self.weights[k].T, delta)
```
instead of:
```
delta = np.dot(self.weights[k].T, tmp)
```
(With the notations of this article, it is:
```
output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
```
)

These 2 arguments seem to be concordant: code (2) is better than code (1).

However, the math seem to show the contrary (see video here; another detail: note that my loss function is multiplied by 1/2 whereas it's not on the video):

Question: which one is correct: the implementation (1) or (2)?

In LaTeX:

$$\frac{\partial{C}}{\partial{w^{L-1}}} = \frac{\partial{z^{L-1}}}{\partial{w^{L-1}}} \frac{\partial{a^{L-1}}}{\partial{z^{L-1}}} \frac{\partial{C}}{\partial{a^{L-1}}}=a^{L-2} \sigma'(z^{L-1}) \times w^L \sigma'(z^L)(a^L-y) $$
$$\frac{\partial{C}}{\partial{w^L}} = \frac{\partial{z^L}}{\partial{w^L}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=a^{L-1} \sigma'(z^L)(a^L-y)$$
$$\frac{\partial{C}}{\partial{a^{L-1}}} = \frac{\partial{z^L}}{\partial{a^{L-1}}} \frac{\partial{a^L}}{\partial{z^L}} \frac{\partial{C}}{\partial{a^L}}=w^L \sigma'(z^L)(a^L-y)$$

I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:

the maths written in LaTeX in the question are correct

the code (1) is the correct one, and it agrees with the math computations:

delta = a - y
for k in [2, 1, 0]:
    tmp = delta * sigmoid_prime(A[k+1])
    delta = np.dot(self.weights[k].T, tmp)
    self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

code (2) is wrong:

delta = a - y
for k in [2, 1, 0]:
    tmp = delta * sigmoid_prime(A[k+1])
    delta = np.dot(self.weights[k].T, delta)  # WRONG HERE
    self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T)

and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)

should be

output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))

Now the difficult part that took me days to realize:

Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong
... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).
Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.

Now solved!

来源：https://stackoverflow.com/questions/53287032/multi-layer-neural-network-back-propagation-formula-using-stochastic-gradient-d

标签

python

machine-learning

neural-network

backpropagation

gradient-descent