R的阶段性复习小结–回归
1. 数据说明
- User_ID:用户ID Gender:
- 性别,M为男性,F为女性
- Age:年龄段,划分为0-17、18-25、26-35、36-45、46-55、55+共六个年龄段
- Occupation:职业,已转换为数字标签,共有21类职业
- Stay_In_Current_City_Years:所在城市居住年份,分为0、1、2、3、4+五个类别
- Marital_Status:婚姻状况,0为未婚,1为已婚
- 件数:本次消费所购买的商品数目
- 消费总额:该用户本次消费所支出的总金额,单位为美元
首先读入数据,file.choose()可以跳出窗口选择文件,输出文件所在位置,超级好用!!!
#install.packages(“Rserve”)
library(“Rserve”)
Rserve()
Starting Rserve…
“C:\Users\LENOVO\DOCUME1\R\WIN-LI1\3.3\Rserve\libs\x64\Rserve.exe”
#R与tableau连接
file.choose()
[1] “F:\新建文件夹 (6)\黑色星期五\book233用户信息.csv”
#读取文件位置
user=read.csv(“F:\新建文件夹 (6)\黑色星期五\book233用户信息.csv”)`
#查看数据结构
str(user)
‘data.frame’: 1047 obs. of 8 variables:
$ User_ID : int 1000001 1000003 1000005 1000006 1000015 1000019 1000020 1000022 1000024 1000033 …
$ Gender : Factor w/ 3 levels “”,“F”,“M”: 2 3 3 2 3 3 3 3 2 3 …
$ Age : Factor w/ 8 levels “”,“0-17”,“18-25”,…: 2 4 4 7 4 2 4 3 4 6 …
$ Occupation : int 10 15 20 9 7 10 14 15 7 3 …
$ Stay_In_Current_City_Years: Factor w/ 6 levels “”,“0”,“1”,“2”,…: 4 5 3 3 3 5 2 6 5 3 …
$ Marital_Status : int 0 0 1 0 0 0 0 0 1 1 …
$ 件数 : int 34 29 106 46 116 144 12 155 76 215 …
$ 消费总额 : int 333481 341635 821001 379450 1047124 1457938 185747 1279678 720850 1940043 …
2. 数据预处理
(1) 删除第一列的用户ID,在利用用户个人信息对用户消费总额进行拟合的过程中,用户ID显然是不能作为自变量的。(唯一属性并不能描述事件本身的分布规律)
(2)删除用户购买商品的数量,由于缺少用户购买商品的具体信息(购买了哪一类商品,不同类别购买商品的数量等),所以将其作为自变量难以描述分布规律。
(3)缺失值处理,经过检测缺失值数量很少,直接删除比较有效。
> #install.packages("mice")
> library("mice")
> md.pattern(user) #缺失值检测
Gender Age Stay_In_Current_City_Years 件数 消费总额 User_ID Occupation Marital_Status
1045 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 0 0 0 3
1 1 1 1 0 0 0 0 0 5
0 0 0 1 1 2 2 2 8
> users=na.omit(user)#缺失值删除
> md.pattern(users)
User_ID Gender Age Occupation Stay_In_Current_City_Years Marital_Status 件数 消费总额
[1,] 1 1 1 1 1 1 1 1 0
[2,] 0 0 0 0 0 0 0 0 0
> #可以看到已经没有缺失值啦
> users_1=users[,-7]
> users_12=users_1[,-1]
> #删除1、7 列
> str(users_12)
'data.frame': 1045 obs. of 6 variables:
$ Gender : Factor w/ 3 levels "","F","M": 2 3 3 2 3 3 3 3 2 3 ...
$ Age : Factor w/ 8 levels "","0-17","18-25",..: 2 4 4 7 4 2 4 3 4 6 ...
$ Occupation : int 10 15 20 9 7 10 14 15 7 3 ...
$ Stay_In_Current_City_Years: Factor w/ 6 levels "","0","1","2",..: 4 5 3 3 3 5 2 6 5 3 ...
$ Marital_Status : int 0 0 1 0 0 0 0 0 1 1 ...
$ 消费总额 : int 333481 341635 821001 379450 1047124 1457938 185747 1279678 720850 1940043 ...
(4)变量类型转换,可以发现表示职业以及婚姻状况的分类变量被自动录入为整数型变量,对二者进行变量类型的转换。
(5)异常值的查看与处理,通过数据的描述可以得知只有消费总额这一变量会出现异常值。异常值为高额消费,结合问题背景,决定不对异常值做处理,其更可能代表少数消费者的消费状态,符合常识。
(6)建立训练集与测试集。
> users_12$Occupation= as.factor(users_12$Occupation)
> users_12$Marital_Status= as.factor(users_12$Marital_Status)
> str(users_12)
'data.frame': 1045 obs. of 6 variables:
$ Gender : Factor w/ 3 levels "","F","M": 2 3 3 2 3 3 3 3 2 3 ...
$ Age : Factor w/ 8 levels "","0-17","18-25",..: 2 4 4 7 4 2 4 3 4 6 ...
$ Occupation : Factor w/ 21 levels "0","1","2","3",..: 11 16 21 10 8 11 15 16 8 4 ...
$ Stay_In_Current_City_Years: Factor w/ 6 levels "","0","1","2",..: 4 5 3 3 3 5 2 6 5 3 ...
$ Marital_Status : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 2 ...
$ 消费总额 : int 333481 341635 821001 379450 1047124 1457938 185747 1279678 720850 1940043 ...
> boxplot(users_12$ 消费总额 ,col="yellow") #箱型图查看缺失值
> boxplot.stats(users_12$ 消费总额)
$stats
[1] 45551 281780 730131 1672669 3737504
$n
[1] 1045
$conf
[1] 662149.4 798112.6
$out
[1] 4355777 6573609 5212846 6310604 4997527 4647555 3917492 4681205 4255176 4054112
[11] 5499812 3770941 6511302 3786677 4003012 5628295 4728932 6387899 4178546 3888766
[21] 5805353 4503530 5136424 5103795 3977702 4055317 8699232 4358776 3797112 6817493
[31] 5549841 5166938 4433272 4135916 4032859 7577505 4303859 6126540 4453785 5673106
[41] 3955182 6476786 4028509 4528519 6186498 5961987 4384924 4664260 5153189 4622308
[51] 6044178 4152683 4094730 3847749 4836540 10536783 4256751 5733683 6565878 4006176
[61] 5129726 5150348 4642305 4689382 4174884 4458155 3824963 4098692 4246978 5075337
[71] 5985405 4354802
> #建立训练集与测试集
> ind = sample(2,nrow(users_12),replace = TRUE,prob=c(0.7,0.3))
> train=users_12[ind==1,]
> test=users_12[ind==2,]
3. 通过多元线性回归预测用户消费总额
#多元线性回归
> set.seed((12))
> users_lm=lm(消费总额~Gender+Age+Occupation+Stay_In_Current_City_Years,data=train)
> users_lm
Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years,
data = train)
Coefficients:
(Intercept) GenderM Age18-25
1112974 330122 -204918
Age26-35 Age36-45 Age46-50
3245 -195851 -274017
Age51-55 Age55+ Occupation1
-573862 -857404 152753
Occupation2 Occupation3 Occupation4
133586 277339 190566
Occupation5 Occupation6 Occupation7
69889 792490 145483
Occupation8 Occupation9 Occupation10
-604240 -469454 -203505
Occupation11 Occupation12 Occupation13
207628 -313140 -330398
Occupation14 Occupation15 Occupation16
341738 36168 922765
Occupation17 Occupation18 Occupation19
-83717 280807 78211
Occupation20 Stay_In_Current_City_Years1 Stay_In_Current_City_Years2
652190 -208613 -33459
Stay_In_Current_City_Years3 Stay_In_Current_City_Years4+
-185424 -208820
> summary(users_lm)
Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years,
data = train)
Residuals:
Min 1Q Median 3Q Max
-1933294 -833549 -358080 413600 8366773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1112974 534216 2.083 0.03758 *
GenderM 330122 112565 2.933 0.00347 **
Age18-25 -204918 499728 -0.410 0.68189
Age26-35 3245 501646 0.006 0.99484
Age36-45 -195851 510165 -0.384 0.70117
Age46-50 -274017 540614 -0.507 0.61241
Age51-55 -573863 526379 -1.090 0.27600
Age55+ -857404 557787 -1.537 0.12471
Occupation1 152753 210666 0.725 0.46864
Occupation2 133586 247114 0.541 0.58897
Occupation3 277339 298621 0.929 0.35335
Occupation4 190566 196994 0.967 0.33370
Occupation5 69890 457117 0.153 0.87853
Occupation6 792490 352716 2.247 0.02496 *
Occupation7 145483 203931 0.713 0.47584
Occupation8 -604240 662511 -0.912 0.36206
Occupation9 -469454 664999 -0.706 0.48046
Occupation10 -203505 513061 -0.397 0.69175
Occupation11 207628 364321 0.570 0.56893
Occupation12 -313140 230735 -1.357 0.17517
Occupation13 -330398 541278 -0.610 0.54179
Occupation14 341738 262857 1.300 0.19400
Occupation15 36168 316049 0.114 0.90892
Occupation16 922765 297426 3.103 0.00200 **
Occupation17 -83717 248834 -0.336 0.73664
Occupation18 280807 594842 0.472 0.63703
Occupation19 78211 436344 0.179 0.85780
Occupation20 652190 235661 2.767 0.00580 **
Stay_In_Current_City_Years1 -208613 149594 -1.395 0.16360
Stay_In_Current_City_Years2 -33459 171450 -0.195 0.84533
Stay_In_Current_City_Years3 -185424 174960 -1.060 0.28960
Stay_In_Current_City_Years4+ -208820 178018 -1.173 0.24119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1288000 on 697 degrees of freedom
Multiple R-squared: 0.08683, Adjusted R-squared: 0.04622
F-statistic: 2.138 on 31 and 697 DF, p-value: 0.0003726
> lm_predictions= predict(users_lm, test)
> #install.packages("gmodels")
> library("gmodels")
> #计算相对误差
> w_lm=mean((lm_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> w_lm
[1] 1.007764
> w__lm=mean((lm_predictions-test$消费总额)^2)
> w__lm
[1] 2.127004e+12
> plot(lm_predictions,test$消费总额)
>
其中T检验的结果还是可以的,自变量都是非常显著,p-value同样也是非常显著,但是调整后的R-squared: 0.04622 ,非常不佳,说明自变量对于因变量的解释率仅为4.6%
模型的相对误差为1.007764,绝对误差2.127004e+12
4. 通过随机森林预测用户消费总额
【 randomForest(x, y=NULL, ntree=500,importance=FALSE,localImp=FALSE, nPerm=1,mtry=3, proximity)】
x:代表需要预测的列,species是列的名称。
y:生成决策树的训练集
ntree:生成决策树的数目
nperm:计算importance时的重复次数
mtry:选择的分裂属性的个数
proximity=TRUE:表示生成临近矩阵
importance=TRUE:输出分裂属性的重要性
> #随机森林
> set.seed(123)
> library("randomForest")
> users_tree=randomForest(消费总额~.,data=train,importance=TRUE,ntree=100)
> print(users_tree)
Call:
randomForest(formula = 消费总额 ~ ., data = train, importance = TRUE, ntree = 100)
Type of random forest: regression
Number of trees: 100
No. of variables tried at each split: 1
Mean of squared residuals: 1.720726e+12
% Var explained: 0.96
> importance((users_tree))
%IncMSE IncNodePurity
Gender 4.6765780 1.445564e+13
Age 4.0652253 4.626723e+13
Occupation 2.7830964 9.723711e+13
Stay_In_Current_City_Years 2.3300941 3.103846e+13
Marital_Status 0.7702007 1.082766e+13
> tree_predictions= predict(users_tree, test)
> t_lm=mean((tree_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> t_lm
[1] 1.001825
> t__lm=mean((tree_predictions-test$消费总额)^2)
> t__lm
[1] 2.114469e+12
> #改变决策森林中树的数目
> users_tree_2=randomForest(消费总额~.,data=train,importance=TRUE,ntree=1000)
> print(users_tree_2)
Call:
randomForest(formula = 消费总额 ~ ., data = train, importance = TRUE, ntree = 1000)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 1
Mean of squared residuals: 1.720764e+12
% Var explained: 0.95
> importance((users_tree_2))
%IncMSE IncNodePurity
Gender 9.670613 1.709002e+13
Age 6.188277 4.585156e+13
Occupation 7.528485 1.080333e+14
Stay_In_Current_City_Years 3.679895 3.365439e+13
Marital_Status 1.603983 9.651066e+12
> tree_2_predictions= predict(users_tree_2, test)
> t2_lm=mean((tree_2_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> t2_lm
[1] 1.001088
> t2__lm=mean((tree_2_predictions-test$消费总额)^2)
> t2__lm
[1] 2.112913e+12
>
>
> #改变决策森林中树的数目
> users_tree_3=randomForest(消费总额~.,data=train,importance=TRUE,ntree=5)
> print(users_tree_3)
Call:
randomForest(formula = 消费总额 ~ ., data = train, importance = TRUE, ntree = 5)
Type of random forest: regression
Number of trees: 5
No. of variables tried at each split: 1
Mean of squared residuals: 1.999001e+12
% Var explained: -15.06
> importance((users_tree_3))
%IncMSE IncNodePurity
Gender 2.1048620 2.034920e+13
Age 2.0914720 7.205510e+13
Occupation -0.1595652 1.474722e+14
Stay_In_Current_City_Years 0.3312179 4.161017e+13
Marital_Status 0.1408100 2.436103e+13
> tree_3_predictions= predict(users_tree_3, test)
> t3_lm=mean((tree_3_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> t3_lm
[1] 1.0148
> t3__lm=mean((tree_3_predictions-test$消费总额)^2)
> t3__lm
[1] 2.141854e+12
结果对比
树的数目 | % Var explained: | 测试集相对误差 |
---|---|---|
100 | 0.96 | 1.001825 |
1000 | 0.95 | 1.001088 |
5 | -15.06 | 1.0148 |
这里一个重要的系数是% Var explained,称为拟合优度,它的作用类似于之前回归分析中的R方。
一般来说,树数量越多,性能越好,预测越稳定,但也会减慢计算速度。通过对于模型中树的数量进行更改,可以发现在拟合优度中,最优者并非1000,而是100颗。
不过在对于测试集的检验中,误差符合认知,从5颗树到100棵,相对误差有了一个较大的提高,但是在从100到1000棵树的改变,并没有带来可以与前者媲美的改进。
可以用varImpPlot命令,画出自变量重要性排序图,
varImpPlot(users_tree)
思考
显而易见随机森林比多元线性回归的拟合效果了一点,但是这个拟合还是太差了,可能是由于自变量均为分类变量,拿来拟合一个数值型变量过于牵强,因为分类变量的排列组合有限,而数值型变量的变化更为复杂
针对这种类型,更适合将因变量分箱,然后用分类的方式进行拟合。
(可以试试在多元回归的模型中插入交叉项,以此来增加变化类型)
插入二次项的尝试
尝试中又发现了让人困惑的点
可以看到在加入各项的平方项之后,拟合优度从4.6%,提高到了8.8%;这和预期效果是一样的,然而在加入交叉项之后不增反降,还出现了陌生词汇“秩缺乏拟合”
可能原因:存在强相关的自变量,也就是设计矩阵非满秩的,会导致这样的结果,也有可能是你的目标是线性可分的,用线性回归就好了
> #加入平方项
> set.seed((12))
> users_jx_lm=lm(消费总额~Gender+Age+Occupation+Stay_In_Current_City_Years+Gender**2+Age**2+Occupation**2+Stay_In_Current_City_Years**2+Gender*Age+Occupation*Stay_In_Current_City_Years,data=train)
>
> summary(users_jx_lm)
Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years +
Gender^2 + Age^2 + Occupation^2 + Stay_In_Current_City_Years^2 +
Gender * Age + Occupation * Stay_In_Current_City_Years, data = train)
#由于太长了symmary只保留了关键信息---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1259000 on 620 degrees of freedom
Multiple R-squared: 0.2238, Adjusted R-squared: 0.08856
F-statistic: 1.655 on 108 and 620 DF, p-value: 0.000131
#加入交叉项和平方项
> w__jx_lm=mean((lm_jx_predictions-test$消费总额)^2)
> users_jx_lm=lm(消费总额~Gender+Age+Occupation+Stay_In_Current_City_Years+Gender**2+Age**2+Occupation**2+Stay_In_Current_City_Years**2+Gender*Age+Occupation*Stay_In_Current_City_Years*Gender+Age*Occupation+Stay_In_Current_City_Years*Age,data=train)
> summary(users_jx_lm)
Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years +
Gender^2 + Age^2 + Occupation^2 + Stay_In_Current_City_Years^2 +
Gender * Age + Occupation * Stay_In_Current_City_Years *
Gender + Age * Occupation + Stay_In_Current_City_Years *
Age, data = train)
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1296000 on 472 degrees of freedom
Multiple R-squared: 0.3745, Adjusted R-squared: 0.03519
F-statistic: 1.104 on 256 and 472 DF, p-value: 0.1805
> lm_jx_predictions= predict(users_jx_lm, test)
Warning message:
In predict.lm(users_jx_lm, test) : 用秩缺乏拟合来进行预测的结果很可能不可靠
> #计算相对误差
> w_jx_lm=mean((lm_jx_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> w_jx_lm
[1] 1.491868
> w__jx_lm=mean((lm_jx_predictions-test$消费总额)^2)
> w__jx_lm
[1] 3.148763e+12
来源:CSDN
作者:Felis catus
链接:https://blog.csdn.net/weixin_44696674/article/details/88072492