神经网络优化器
首先梯度下降算法一共有三个变形:BGD, SGD , MBGD, 这三种形式的区别就是取决于我们用多少数据来计算目标函数的梯度. 1.BGD( Batch gradient descent ) BGD 采用整个训练集的数据来执行一次更新: for i in range(nb_epochs ): params_grad = evaluate_gradient(loss_function , data , params) params = params - learning_rate * params_grad 缺点是: (1).Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces(凸函数可以保证到全局最优,非凸函数可能收敛到局部最优). (2).As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is