1、回归(regression)与 分类(Classification)区别,前者处理的是连续型数值变量。后者处理的是类别变量。
2、回归分析:建立方程模拟2个或多个变量之间关联关系。
3、简单线性回归:y=b1*x+b0
1) 参数b1,b0可以由如上公式计算出来,xi,yi为样本中各点。numpy实现简单线性回归方程。
# y = b1*x+b0 import numpy as np def fitSLR(x,y): n = len(x) fenzi = 0 fenmu = 0 for i in range(0,n): fenzi = fenzi + (x[i]- np.mean(x))*(y[i]- np.mean(y)) fenmu = fenmu + (x[i]- np.mean(x))**2 print(fenzi) print(fenmu) b1 = fenzi/float(fenmu) b0 = np.mean(y)- b1*np.mean(x) print(“b0:”,b0,"b1:",b1) return b0,b1 def predict(x,b0,b1): return b0+b1*x x = [1,3,2,1,3] y = [14,24,18,17,27] b0,b1 = fitSLR(x,y) x_test = 6 y_test = predict(x_test,b0,b1) print("y_test", y_test)
得出:b0: 10.0 b1: 5.0
2)调用statsmodels统计建模模块中的ols函数
import statsmodels.api as sm
import statsmodels.api as sm import pandas as pd import numpy as np x = [1,3,2,1,3] y = [14,24,18,17,27] data = np.vstack((x,y)) dat = pd.DataFrame(data.T,columns = ['x','y']) fit = sm.formula.ols('y ~ x',data = dat).fit() print(fit.params)
结果为:
3) sklearn子模块linear_model中的LinearRegression方法
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression x = np.array([1,3,2,1,3]) y = np.array([14,24,18,17,27]) model = LinearRegression(fit_intercept = True) model.fit(x[:,np.newaxis], y) xfit = np.linspace(0,10,1000) yfit = model.predict(xfit[:,np.newaxis]) plt.scatter(x,y) plt.plot(xfit,yfit) plt.show() print("Model slope: " , model.coef_[0]) print("Model intercept: " , model.intercept_)
此方法注意引入的x、y须为array形式
Model slope: 4.999999999999998
Model intercept: 10.000000000000004
4、多元线性回归
python模块中有2种方式均可构建多元线性回归模型,一种是简单线性回归中sklearn子模块linear_model,还可以利用statsmodels统计建模模块中的ols函数进行构建。
1)statsmodels模块(ols函数)
from sklearn import model_selection # 便于交叉验证,可将模块分解成一定数量训练集和测试集 import statsmodels.api as sm import pandas as pd import numpy as np import matplotlib.pyplot as plt Profit = pd.read_excel(r'Predict to Profit.xlsx') Profit.head()
数据集中State变量为非连续性变量,需要进行转化成哑变量。
# 对离散型变量State,需进行量化处理,(哑变量) train, test = model_selection.train_test_split(Profit, test_size =0.2, random_state = 1234) model = sm.formula.ols('Profit~RD_Spend+Administration+Administration+C(State)',data = train).fit() # 回归系数params model.params # 查看模型总的情况 model.summary()
结果中State值的回归系数只出现2个,原因是建模时State的3个值,另外一个值State.California被用作了对照组。
模型预测后结果:
2)sklearn子模块linear_model。
引入模块,生成哑变量
from sklearn import preprocessing from sklearn import model_selection from sklearn.linear_model import LinearRegression import pandas as pd import numpy as np Profit = pd.read_excel(r'Predict to Profit.xlsx') dummy_Profit = pd.get_dummies(Profit['State'],prefix = 'State') # 转化哑变量 Profit_d = Profit.join(dummy_Profit).drop('State',axis =1) columns = ['RD_Spend','Administration','Marketing_Spend','State_California','State_Florida','State_New York','Profit'] Profit_d = Profit_d[columns]
转化后数据集前5行:
模型训练及预测:
train,test = model_selection.train_test_split(Profit_d,test_size=0.2,random_state=1234) model = LinearRegression(fit_intercept = True) model.fit(train.iloc[:,:-1],train.iloc[:,-1]) print(model.intercept_) print(model.coef_) test_X = test.drop(labels = 'Profit',axis =1) pred = model.predict(test_X) print(pd.DataFrame({'prediction':pred , 'real':test.Profit}))
预测结果:
以上2种方式比较,使用statsmodels中ols函数构建线性回归模型时,若数据集中存在离散变量,需构建哑变量,构建方式将其变成分类变量:C(变量)的形式处理。而linear_model构建线性模型时,数据集中离散变量通过引入preprocessing模块,通过get_dummies()函数处理。
3)对于第一种ols函数方法哑变量中对照组值是系统自动确定的,如需要指定对照组。可以先采用pandas中get_dummies()函数生成哑变量,在删除掉对照组对应的哑变量值。
# 选定State中New York作为对照组 dummies = pd.get_dummies(Profit.State,prefix = 'State') Profit_New = pd.concat([Profit,dummies],axis=1) Profit_New.drop(labels = ['State','State_New York'],axis =1,inplace = True) train , test = model_selection.train_test_split(Profit_New,test_size = 0.2,random_state=1234) model = sm.formula.ols('Profit~RD_Spend+Administration+Marketing_Spend+State_California+State_Florida',data = train).fit() model.params
以New York作为对照组的各偏回归系数情况如下:
得到回归方程:Profit = 58068.048193 + 0.803487RD_Spend - 0.057792Administration + 0.013779Marketing_Spend + 513.468310State_California + 1440.862734State_Florida , 其他变量不变的情况下,RD_Spend每增加1美元,Profit 增加0.803487美元,以new york 为基准,如果在State_Florida销售产品,利润会增加1440.862734。
生成预测值:
test_X = test.drop('Profit',axis=1) pred = model.predict(test_X) print(pd.DataFrame({"prediction":pred,"real":test.Profit}))
对比test值: