Multi-layer neural network back-propagation formula (using stochastic gradient descent)

北城余情 提交于 2019-12-06 21:01:28

I spent two days to analyze this problem, I filled a few pages of notebook with partial derivative computations... and I can confirm:

  • the maths written in LaTeX in the question are correct
  • the code (1) is the correct one, and it agrees with the math computations:

    delta = a - y
    for k in [2, 1, 0]:
        tmp = delta * sigmoid_prime(A[k+1])
        delta = np.dot(self.weights[k].T, tmp)
        self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T) 
    
  • code (2) is wrong:

    delta = a - y
    for k in [2, 1, 0]:
        tmp = delta * sigmoid_prime(A[k+1])
        delta = np.dot(self.weights[k].T, delta)  # WRONG HERE
        self.weights[k] -= self.learning_rate * np.dot(tmp, A[k].T) 
    

    and there a slight mistake in Machine Learning with Python: Training and Testing the Neural Network with MNIST data set:

    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors)
    

    should be

    output_errors = np.dot(self.weights_matrices[layer_index-1].T, output_errors * out_vector * (1.0 - out_vector))
    

Now the difficult part that took me days to realize:

  • Apparently the code (2) has a far better convergence than code (1), that's why I mislead into thinking code (2) was correct and code (1) was wrong

  • ... But in fact that's just a coincidence because the learning_rate was set too low. Here is the reason: when using code (2), the parameter delta is growing much faster (print np.linalg.norm(delta) helps to see this) than with the code (1).

  • Thus "incorrect code (2)" just compensated the "too slow learning rate" by having a bigger delta parameter, and it lead, in some cases, to an apparently faster convergence.

Now solved!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!