优秀的讲解博客

算法简述

梯度下降通常是通过迭代的方式来搜索某个函数的极大/小值，他对目标函数每个变量求偏导得出梯度，也就是沿着梯度方向函数值会增加的最快，那么要求最小值就要沿着梯度值的反方向，梯度下降分为随机梯度下降与批量梯度下降，以及小批量梯度下降，随机梯度相比批量梯度耗时少，但精度不如批量高，批量每一步都沿着下降最快的方向下降，但是样本很多的话耗时很多，还有就是随机梯度具有随机的特性，可能会跳出局部最优解从而到达全局最优解，而批量梯度则会一步步的走向局部最优解

模拟梯度下降法

梯度下降搜索一元二次方程最小值

通过梯度下降求解 y = (x-2.5) ^ 2 - 1的最小值

import numpy as np
import matplotlib.pyplot as plt

plot_x = np.linspace(-1., 6., 141)# 造一个数据
plot_y = (plot_x-2.5) ** 2 - 1.
#plt.plot(plot_x, plot_y)
#plt.show()

epsilon = 1e-8 #误差
eta = 0.1 # 学习率

def J(theta):# 要求解的函数
    return (theta - 2.5) ** 2 - 1.
def dJ(theta): # 对函数求导之后的式子
    return 2*(theta-2.5)
theta = 0.0 # theta初始值赋值0
theta_history = [theta] #用来记录theta曾经的取值，来获取theta的变化趋势
while True:
    gradient = dJ(theta) #求出梯度
    last_theta = theta #记录下之前的theta
    theta = theta - eta * gradient #沿着梯度求出新的theta
    theta_history.append(theta)
    if(abs(J(last_theta)-J(theta)) < epsilon): # 如果变得够小了就默认到达了极值点 
        break

print(theta)  # 最小值的theta
print(J(theta)) #最小值是多少
plt.plot(plot_x, plot_y)
plt.plot(np.array(theta_history), J(np.array(theta_history)), color = 'r', marker = '+' )
plt.show()
# 此算法如果eta取值过大会死循环，因此也可以直接限制他的次数
def gradient_descent(initial_theta, eta, n_iters = 1e4, epsilon=1e-8):

    theta = initial_theta
    i_iter = 0
    theta_history.append(initial_theta)

    while i_iter < n_iters:
        gradient = dJ(theta)
        last_theta = theta
        theta = theta - eta * gradient
        theta_history.append(theta)

        if(abs(J(theta) - J(last_theta)) < epsilon):
            break

        i_iter += 1

    return

2.499891109642585
-0.99999998814289

这里写图片描述

利用批量梯度下降法求解简单线性回归模型

用批量梯度下降来搜寻一个一元线性回归模型,所以其根本就是搜索损失函数最小值所对应的系数，对于一般的平方损失函数，求它的梯度，会导致梯度的大小与样本的数量有关，所以我们就在损失函数前面除以一个m，然后分别求偏导得出梯度，其推到如下图
、
这里写图片描述

、

import numpy as np
import matplotlib.pyplot as plt
#构造一个线性回归数组
np.random.seed(666)
x = 2 * np.random.random(size = 100) 
y = x * 3. + 4 + np.random.normal(size = 100) # 加入噪音
X = x.reshape(-1, 1) #即100个样本，每个样本一个特征
plt.scatter(x, y)
plt.show()

def J(theta, X_b, y): # 损失函数
    try:
        return np.sum((y - X_b.dot(theta)) ** 2) / len(X_b) # 平方损失函数
    except:
        return float('inf')

def dJ(theta, X_b, y): # 求偏导
    res = np.empty(len(theta)) # 开一个res空间存储偏导，
    res[0] = np.sum(X_b.dot(theta) - y) #第一个是截距，先求出来
    for i in range(1, len(theta)):
        res[i] = (X_b.dot(theta)-y).dot(X_b[:,i]) # 求解第i个特征（theta）的偏导，另外最后一个点乘其实已经把所有样本的值都加起来了
    return res * 2 / len(X_b) # 最后要 所有数值都要*2/m

def gradient_descent(X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
    theta = initial_theta 
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = dJ(theta, X_b, y) #求解梯度
        last_theta = theta
        theta = theta - eta * gradient
        if(abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break

        cur_iter += 1

    return theta


X_b = np.hstack([np.ones((len(x), 1)), x.reshape(-1,1)]) #水平接起来（列与列相接）
initial_theta = np.zeros(X_b.shape[1]) #初始化 theta都是0
eta = 0.01 # 学习的步长

theta = gradient_descent(X_b, y, initial_theta, eta)
theta # 答案与4.0 3.0很接近

这里写图片描述

array([4.02145786, 3.00706277])

把批量梯度下降向量化和标准化

把梯度里的for循环向量化，其实就是把θ*X - y 转置成行向量与X_b相乘（X_b第一列为截距，全部赋值为1）
F%5B7H%284%7BBTG%60XHB_12JZMKBU.png
如果不归一化，有的特征数值太大有的特征数值太小，导致步长eta可能相对“太大”或者“太小”，因此要标准化

# 梯度下降法向量化
def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
    """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
    assert X_train.shape[0] == y_train.shape[0], \
        "the size of X_train must be equal to the size of y_train"

    def J(theta, X_b, y):
        try:
            return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
        except:
            return float('inf')

    def dJ(theta, X_b, y):
        return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y) #先把X_b 转置，进而向量化

    def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):

        theta = initial_theta
        cur_iter = 0

        while cur_iter < n_iters:
            gradient = dJ(theta, X_b, y)
            last_theta = theta
            theta = theta - eta * gradient
            if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                break

            cur_iter += 1

        return theta

    X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
    initial_theta = np.zeros(X_b.shape[1])
    self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

    self.intercept_ = self._theta[0]
    self.coef_ = self._theta[1:]

    return self

import numpy as np
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
#归一化
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
lin_reg3 = LinearRegression()
%time lin_reg3.fit_gd(X_train_standard, y_train)
X_test_standard = standardScaler.transform(X_test)
lin_reg3.score(X_test_standard, y_test)

随机梯度下降

任意找一个样本，计算他的梯度，代表损失函数的梯度，他不会像批量梯度下降一样，每次沿着下降最快的方向下降，而且一定能到达一个极值点，随机梯度具有不可预知性，他甚至可能有一步反向减少，但是按照经验随机梯度也会到达极值点附近，属于用一定精度换取时间。随机下降他的步长通常是随着迭代次数越来越多，他的学习步长也随之减少，这种逐渐递减的思想是模拟在一个搜索领域算法的思想，即模拟退火的思想。
下图是随机梯度下降的梯度求解：
这里写图片描述
对步长进行改进：
~C@%5B%7DLFW%5DQTQ~AG30WOXRQ0.png

实现自己的随机梯度下降法

#%%time
import numpy as np
import matplotlib.pyplot as plt
m = 100000
x = np.random.normal(size=m)
X = x.reshape(-1, 1) # 转换成矩阵形式
y = 4. * x + 3. + np.random.normal(0, 3, size=m) # 后面的是加一些噪音

def dJ_sgd(theta, X_b_i, y_i): # 求梯度向量
    return 2 * X_b_i.T.dot(X_b_i.dot(theta) - y_i) #按照图中式子求就好了

def sgd(X_b, y, init_theta, n_iters):  # 朴素随机梯度下降
    t0, t1 = 5, 50 # 按经验， 通常 5 跟 50比较好
    def learning_rate(t): #求出现在应该的步长
        return t0 / (t+t1)

    theta = init_theta
    for cur_iter in range(n_iters):
        rand_i = np.random.randint(len(X_b)) # 随机生成一个索引，代表要进行操作的样本
        gradient = dJ_sgd(theta, X_b[rand_i], y[rand_i]) #  求梯度
        theta = theta - learning_rate(cur_iter) * gradient # 更改参数值
    return theta

# 第二种随机梯度下降的思路  与sklearn里的模式差不多 但是sklearn优化很多
def fit_sgd(self, X_train, y_train, n_iters=50, t0=5, t1=50):  #这里的n_iters代表要对所有的样本跑几遍
        """根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert n_iters >= 1

        def dJ_sgd(theta, X_b_i, y_i):
            return X_b_i * (X_b_i.dot(theta) - y_i) * 2.

        def sgd(X_b, y, initial_theta, n_iters=5, t0=5, t1=50):

            def learning_rate(t):
                return t0 / (t + t1)

            theta = initial_theta
            m = len(X_b)
            for i_iter in range(n_iters): #对所有样本跑n_iters次
                indexes = np.random.permutation(m) #生成一个全排列
                X_b_new = X_b[indexes,:] #打乱顺序
                y_new = y[indexes]
                for i in range(m):  #这样就保证了每个样本都被用到而且是随机的
                    gradient = dJ_sgd(theta, X_b_new[i], y_new[i])
                    theta = theta - learning_rate(i_iter * m + i) * gradient

            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.random.randn(X_b.shape[1])
        self._theta = sgd(X_b, y_train, initial_theta, n_iters, t0, t1)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, initial_theta, n_iters=m)
theta

array([2.96053678, 4.03651892])

scikit-learn中的SGD

# 引入波士顿放假的数据
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
# 进行训练/测试数据分割
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 666)
# 数据归一化
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)

from sklearn.linear_model import SGDRegressor # 注意这个随机梯度下降是从linear里引出的，所以只解决线性模型
#sgd_reg = SGDRegressor() #创建对象
#%time sgd_reg.fit(X_train_standard, y_train) # 进行拟合训练
#sgd_reg.score(X_test_standard, y_test) #求R^2

sgd_reg = SGDRegressor(n_iter=50) #让每一个样本至少选50次
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test)

Wall time: 4 ms
0.7991560557007135

一种简单的梯度计算调试

如果要搜索的式子十分麻烦，我们推导出了错误的梯度公式，却不一定能发现错误，反而认为是参数问题，这里提供一种很简单的梯度验证方法，可以先用小数据跑一下，大体知道每个梯度的范围，这样可以对自己推出的梯度公式有一个比较直观的判断。
思路也很简单，其实就是在每个θ值左右很小距离选两个点，用这两个点的斜率代表这个点的偏导数，比较暴力，费事比较长，但是有一定的准确性

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
#构造一个 要预测的线性模型
X = np.random.random(size=(1000, 10))
true_theta = np.arange(1, 12, dtype=float) # 构造一个θ
X_b = np.hstack([np.ones((len(X), 1)), X])
y = X_b.dot(true_theta) + np.random.normal(size=1000) #加上噪音

def J(theta, X_b, y): #损失函数
    try:
        return np.sum((y - X_b.dot(theta))**2) / len(X_b)
    except:
        return float('inf')

def dJ_math(theta, X_b, y): #数学公式推导的梯度
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

def dJ_debug(theta, X_b, y, epsilon=0.01): #debug用的梯度
    res = np.empty(len(theta)) 
    for i in range(len(theta)): # 对每个θ 暴力的选距离很近的两个点的斜率来代表他的梯度
        theta_1 = theta.copy()
        theta_1[i] += epsilon 
        theta_2 = theta.copy()
        theta_2[i] -= epsilon
        res[i] = (J(theta_1, X_b, y) - J(theta_2, X_b, y)) / (2 * epsilon) # 其实就是两个点的的斜率
    return res

def gradient_descent(dJ, X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8): # 进行梯度下降搜索，dj代表选哪种梯度求法

    theta = initial_theta
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        if(abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break

        cur_iter += 1

    return theta

X_b = np.hstack([np.ones((len(X), 1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

%time theta = gradient_descent(dJ_debug, X_b, y, initial_theta, eta)
theta

%time theta = gradient_descent(dJ_math, X_b, y, initial_theta, eta)
theta

Wall time: 6.14 s
Wall time: 862 ms

array([ 1.1251597 ,  2.05312521,  2.91522497,  4.11895968,  5.05002117,
        5.90494046,  6.97383745,  8.00088367,  8.86213468,  9.98608331,
       10.90529198])

来源：CSDN

作者：键盘里的青春

链接：https://blog.csdn.net/qq_34374664/article/details/80317426

标签

梯度下降