单变量线性回归的最小二乘法公式

寵の児 提交于 2020-02-22 18:58:17

单变量线性回归

假设有nn个点 (x1,y1),(x2,y2),,(xn,yn)(x_1, \, y_1), \, (x_2, \, y_2), \cdots, \, (x_n, \, y_n), 我们希望用一个线性关系y=β0+β1xy = \beta_0 + \beta_1 x 来拟合这nn个点。β0\beta_0β1\beta_1的值须要通过使 RSS=i=1n(yi(β0+β1xi))2\displaystyle \text{RSS} = \sum_{i = 1}^n \left( y_i - (\beta_0 + \beta_1 x_i) \right)^2最小来确定。这里RSSRSS 是residual sum of squares,即残差的平方和。这个方法即最小二乘法。

线性回归系数的确定

导数方法

对于如何求出 β0\beta_0β1\beta_1的值,在众多教科书中已有广泛得描述。其中最普遍的方法是把 RSS 对 β0\beta_0β1\beta_1 进行求偏导。取使得两个偏导数为0的β0\beta_0β1\beta_1的值。具体如下:
{ RSSβ0=2nβ0+2xiβ12yi RSSβ1=2xi2β1+2xiβ02xiyi \begin{cases} & \frac{\partial \text{ RSS}}{\partial \beta_0} = 2 n \beta_0 + 2 \sum x_i \beta_1 - 2 \sum y_i \\ & \frac{\partial \text{ RSS}}{\partial \beta_1} = 2 \sum x_i^2 \beta_1 + 2 \sum x_i \beta_0 - 2 \sum x_i y_i \end{cases}
这里的求和符号\sum 均表示从i=1i = 1i=ni = n。通过求解这个二元一次线性方程组,我们可以得到:
{β1=(xixˉ)(yiyˉ)(xixˉ)2β0=yˉβ1xˉ \begin{cases} & \beta_1 = \frac{\sum (x_i - \bar{x} ) (y_i - \bar{y})}{\sum (x_i - \bar{x} )^2 } \\ & \beta_0 = \bar{y} - \beta_1 \bar{x} \\ \end{cases}
这里求和符号\sum 依然均表示从i=1i = 1i=ni = nxˉ=i=1nxin\bar{x} = \frac{\sum_{i = 1}^n x_i}{n}, yˉ=i=1nyin\bar{y} = \frac{\sum_{i = 1}^n y_i}{n}。除此之外,我们须要验证二次导数大于0,来确定这个极值点是最小值。

配方法

我们把RSS 的表达式展开,有
RSS(β0,β1)=nβ02+xi2β12+yi2+2xiβ0β12yiβ02xiyiβ1\text{RSS} (\beta_0, \, \beta_1) = n \beta_0^2 + \sum x_i^2 \, \beta_1^2 + \sum y_i^2 + 2 \sum x_i \, \beta_0 \beta_1 - 2 \sum y_i \, \beta_0 - 2 \sum x_i y_i \, \beta_1
因为 RSS(β0,β1)\text{RSS} (\beta_0, \, \beta_1) 只是β0\beta_0β1\beta_1 的二次函数,我们可以通过配方法来找到使得 RSS(β0,β1)\text{RSS} (\beta_0, \, \beta_1) 最小的 β0\beta_0β1\beta_1 的值。

先化简一下,RSS(β0,β1)/n=β02+xi2nβ12+yi2n+2xinβ0β12yinβ02xiyinβ1\text{RSS} (\beta_0, \, \beta_1) / n= \beta_0^2 + \frac{\sum x_i^2}{n} \, \beta_1^2 + \frac{\sum y_i^2}{n} + 2 \frac{\sum x_i}{n} \, \beta_0 \beta_1 - 2 \frac{\sum y_i}{n} \, \beta_0 - 2 \frac{\sum x_i y_i}{n} \, \beta_1

我们先通过配发匹配掉 β02\beta_0^2β0\beta_0,和 β0β1\beta_0 \beta_1 这三项。具体计算如下:
RSS/n=(β0+xinβ1yin)2(xin)2β12+xi2nβ12+yi2n(yin)2+2xinyinβ12xiyinβ1\text{RSS} / n = \left(\beta_0 + \frac{\sum x_i}{n} \, \beta_1 - \frac{\sum y_i}{n} \right)^2 - ( \frac{\sum x_i}{n} )^2 \beta_1^2 + \frac{\sum x_i^2}{n} \beta_1^2 + \frac{\sum y_i^2}{n} -(\frac{\sum y_i}{n} )^2 + 2 \frac{\sum x_i}{n} \frac{\sum y_i}{n} \beta_1 - 2 \frac{\sum x_i y_i}{n} \, \beta_1

然后我们再匹配掉 β12\beta_1^2β1\beta_1 这两项。我们有:

RSS/n=(β0+xinβ1yin)2+(xi2n(xin)2)[β12+2xinyin2xiyinxi2n(xin)2β1+(xinyinxiyinxi2n(xin)2)2]+yi2n(yin)2(xinyinxiyin)2xi2n(xin)2=(β0+xinβ1yin)2+(xi2n(xin)2)[β1+xinyinxiyinxi2n(xin)2]2+yi2n(yin)2(xinyinxiyin)2xi2n(xin)2\begin{aligned} \text{RSS} / n &= \left(\beta_0 + \frac{\sum x_i}{n} \, \beta_1 - \frac{\sum y_i}{n} \right)^2 + \\ & \left(\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2 \right) \left[ \beta_1^2 + \frac{2 \frac{\sum x_i}{n} \frac{\sum y_i}{n} - 2\frac{\sum x_i y_i}{n}}{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2 } \beta_1 + \left( \frac{\frac{\sum x_i}{n} \frac{\sum y_i}{n} - \frac{\sum x_i y_i}{n}}{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2} \right)^2 \right] \\ & \qquad + \frac{\sum y_i^2}{n} - (\frac{\sum y_i}{n} )^2 - \frac{ \left( \frac{\sum x_i}{n} \frac{\sum y_i}{n} - \frac{\sum x_i y_i}{n} \right)^2}{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2} \\ &= \left(\beta_0 + \frac{\sum x_i}{n} \, \beta_1 - \frac{\sum y_i}{n} \right)^2 + \left(\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2 \right) \left[ \beta_1 + \frac{ \frac{\sum x_i}{n} \frac{\sum y_i}{n} - \frac{\sum x_i y_i}{n} }{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2} \right]^2 \\ & \qquad + \frac{\sum y_i^2}{n} - (\frac{\sum y_i}{n} )^2 - \frac{ \left( \frac{\sum x_i}{n} \frac{\sum y_i}{n} - \frac{\sum x_i y_i}{n} \right)^2}{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2} \\ \end{aligned}

经过两次配方,我们可以看到,要使得RSS/n\text{RSS} / n 最小,我们须要有
{β1=xiyinxinyinxi2n(xin)2β0=yˉβ1xˉ \begin{cases} & \beta_1 = \frac{ \frac{\sum x_i y_i}{n} - \frac{\sum x_i}{n} \frac{\sum y_i}{n} }{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2} \\ & \beta_0 = \bar{y} - \beta_1 \bar{x} \\ \end{cases}

经过化简计算,我们有 xiyinxinyinxi2n(xin)2=(xixˉ)(yiyˉ)(xixˉ)2\displaystyle \frac{ \frac{\sum x_i y_i}{n} - \frac{\sum x_i}{n} \frac{\sum y_i}{n} }{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2} = \frac{\sum (x_i - \bar{x} ) (y_i - \bar{y})}{\sum (x_i - \bar{x} )^2 }。所以,用配方法和用求导的方法获得的 β0\beta_0β1\beta_1 的值是一样的。

通过上述计算,可以发现配方法的过程要繁琐很多。并且配方法不易于对多变量(multivariate)回归的推广。

配方法的一个好处是避免了求导这个步骤。另外,配方法的一个直接结果就是我们可以得到最小的 RSS 的值。即当 β0\beta_0β1\beta_1 按照最优解取值时,我们有
RSS/n=yi2n(yin)2(xinyinxiyin)2xi2n(xin)2\displaystyle \text{RSS} / n = \frac{\sum y_i^2}{n} - (\frac{\sum y_i}{n} )^2 - \frac{ \left( \frac{\sum x_i}{n} \frac{\sum y_i}{n} - \frac{\sum x_i y_i}{n} \right)^2}{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2},即

RSS=yi2n(yin)2n(xinyinxiyin)2xi2n(xin)2\displaystyle \text{RSS} = \sum y_i^2 - n (\frac{\sum y_i}{n} )^2 - n \frac{ \left( \frac{\sum x_i}{n} \frac{\sum y_i}{n} - \frac{\sum x_i y_i}{n} \right)^2}{\frac{\sum x_i^2}{n} - ( \frac{\sum x_i}{n} )^2}

简单程序验证

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression

class least_square_singleVariable:
    
    def __init__(self):
        return
    
    def find_beta_1_and_beta_0(self, train_x: 'pd.Series', train_y: 'pd.Series') -> 'tuple(float, float)':
        """
        Given the independent variable train_x and the dependent variable train_y, find the value for 
        beta_1 and beta_0
        """
        x_bar = np.mean(train_x)
        y_bar = np.mean(train_y)
        beta_1 = np.dot(train_x - x_bar, train_y - y_bar) / (np.sum((train_x - x_bar) ** 2))
        beta_0 = y_bar - beta_1 * x_bar
        return beta_1, beta_0
        
    def find_optimal_RSS(self, train_x: "numpy.ndarray", train_y: "numpy.ndarray") -> float:
        """
        Calculate the residual sum of squares (RSS) using the formula derived in the text above.
        """
        n = len(train_x)
        x_bar = np.mean(train_x)
        y_bar = np.mean(train_y)
        sum_xi_square = np.sum(train_x ** 2)
        sum_yi_square = np.sum(train_y ** 2)
        TSS = np.sum((train_y - y_bar) ** 2)
        sum_xiyi = np.dot(train_x, train_y)
        res = sum_yi_square - n * y_bar ** 2 - \
                n * (x_bar * y_bar - sum_xiyi / n) ** 2 / (sum_xi_square / n - x_bar ** 2)
        return res
    
    def get_optimal_RSS_sklearn(self, train_x: "numpy.ndarray", train_y: "numpy.ndarray") -> float:
        """
        Calculate the residual sum of squares using the LinearRegression() model in sklearn. 
        The result should be the same as the result returned by the function find_optimal_RSS
        """
        n = len(train_x)
        model =  LinearRegression()
        model.fit(train_x.reshape(n, 1), train_y)
        R_square = model.score(train_x.reshape(n, 1), train_y)
        y_bar = np.mean(train_y)
        TSS = np.sum((train_y - y_bar) ** 2)
        RSS = (1 - R_square) * TSS
        return RSS

我们来验证利用上述公式计算得到的 RSS 是否与 sklearn 包中 LinearRegression() 给出的 RSS 相等。

a = least_square_singleVariable()
num_points = 100
train_x = np.linspace(0, 10, num_points)
train_y = 2 * train_x + 3 + np.random.normal(0, 1, num_points)
beta_1, beta_0 = (a.find_beta_1_and_beta_0(train_x, train_y))
print(a.find_optimal_RSS(train_x, train_y))
print(a.get_optimal_RSS_sklearn(train_x, train_y))

输出的结果如下:

89.00851265874053
89.00851265873582

我们发现利用我们推导出来的公式计算得到的RSS 值与sklearn 中利用R-square 来计算得到的 RSS 是一致的。

作图如下:

plt.figure(figsize=(8, 6), dpi=100)
plt.scatter(train_x, train_y)
plt.plot(train_x, beta_1 * train_x + beta_0, color='red', linewidth=4)
plt.xlabel("x value", fontsize=20)
plt.ylabel("y value", fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(['linear model', 'training data'], fontsize=15)

在这里插入图片描述

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!