R的阶段性复习小结–回归

1. 数据说明

User_ID：用户ID Gender：
性别，M为男性，F为女性
Age：年龄段，划分为0-17、18-25、26-35、36-45、46-55、55+共六个年龄段
Occupation：职业，已转换为数字标签，共有21类职业
Stay_In_Current_City_Years：所在城市居住年份，分为0、1、2、3、4+五个类别
Marital_Status：婚姻状况，0为未婚，1为已婚
件数：本次消费所购买的商品数目
消费总额：该用户本次消费所支出的总金额，单位为美元

首先读入数据，file.choose()可以跳出窗口选择文件，输出文件所在位置，超级好用！！！

#install.packages(“Rserve”)
library(“Rserve”)
Rserve()
Starting Rserve…
“C:\Users\LENOVO\DOCUME_1\R\WIN-LI1\3.3\Rserve\libs\x64\Rserve.exe”
#R与tableau连接
file.choose()
[1] “F:\新建文件夹 (6)\黑色星期五\book233用户信息.csv”
#读取文件位置
user=read.csv(“F:\新建文件夹 (6)\黑色星期五\book233用户信息.csv”)`
#查看数据结构
str(user)
‘data.frame’: 1047 obs. of 8 variables:
$ User_ID : int 1000001 1000003 1000005 1000006 1000015 1000019 1000020 1000022 1000024 1000033 …
$ Gender : Factor w/ 3 levels “”,“F”,“M”: 2 3 3 2 3 3 3 3 2 3 …
$ Age : Factor w/ 8 levels “”,“0-17”,“18-25”,…: 2 4 4 7 4 2 4 3 4 6 …
$ Occupation : int 10 15 20 9 7 10 14 15 7 3 …
$ Stay_In_Current_City_Years: Factor w/ 6 levels “”,“0”,“1”,“2”,…: 4 5 3 3 3 5 2 6 5 3 …
$ Marital_Status : int 0 0 1 0 0 0 0 0 1 1 …
$ 件数 : int 34 29 106 46 116 144 12 155 76 215 …
$ 消费总额 : int 333481 341635 821001 379450 1047124 1457938 185747 1279678 720850 1940043 …

2. 数据预处理

（1）删除第一列的用户ID，在利用用户个人信息对用户消费总额进行拟合的过程中，用户ID显然是不能作为自变量的。（唯一属性并不能描述事件本身的分布规律）
（2）删除用户购买商品的数量，由于缺少用户购买商品的具体信息（购买了哪一类商品，不同类别购买商品的数量等），所以将其作为自变量难以描述分布规律。
（3）缺失值处理，经过检测缺失值数量很少，直接删除比较有效。

> #install.packages("mice")
> library("mice")
> md.pattern(user)  #缺失值检测
     Gender Age Stay_In_Current_City_Years 件数 消费总额 User_ID Occupation Marital_Status  
1045      1   1                          1    1        1       1          1              1 0
   1      1   1                          1    1        1       0          0              0 3
   1      1   1                          1    0        0       0          0              0 5
          0   0                          0    1        1       2          2              2 8
> users=na.omit(user)#缺失值删除
> md.pattern(users)
     User_ID Gender Age Occupation Stay_In_Current_City_Years Marital_Status 件数 消费总额  
[1,]       1      1   1          1                          1              1    1        1 0
[2,]       0      0   0          0                          0              0    0        0 0
> #可以看到已经没有缺失值啦
> users_1=users[,-7]
> users_12=users_1[,-1]
> #删除1、7 列
> str(users_12)
'data.frame':	1045 obs. of  6 variables:
 $ Gender                    : Factor w/ 3 levels "","F","M": 2 3 3 2 3 3 3 3 2 3 ...
 $ Age                       : Factor w/ 8 levels "","0-17","18-25",..: 2 4 4 7 4 2 4 3 4 6 ...
 $ Occupation                : int  10 15 20 9 7 10 14 15 7 3 ...
 $ Stay_In_Current_City_Years: Factor w/ 6 levels "","0","1","2",..: 4 5 3 3 3 5 2 6 5 3 ...
 $ Marital_Status            : int  0 0 1 0 0 0 0 0 1 1 ...
 $ 消费总额                  : int  333481 341635 821001 379450 1047124 1457938 185747 1279678 720850 1940043 ...

（4）变量类型转换，可以发现表示职业以及婚姻状况的分类变量被自动录入为整数型变量，对二者进行变量类型的转换。
（5）异常值的查看与处理，通过数据的描述可以得知只有消费总额这一变量会出现异常值。异常值为高额消费，结合问题背景，决定不对异常值做处理，其更可能代表少数消费者的消费状态，符合常识。
（6）建立训练集与测试集。在这里插入图片描述

> users_12$Occupation= as.factor(users_12$Occupation)
> users_12$Marital_Status= as.factor(users_12$Marital_Status)
> str(users_12)
'data.frame':	1045 obs. of  6 variables:
 $ Gender                    : Factor w/ 3 levels "","F","M": 2 3 3 2 3 3 3 3 2 3 ...
 $ Age                       : Factor w/ 8 levels "","0-17","18-25",..: 2 4 4 7 4 2 4 3 4 6 ...
 $ Occupation                : Factor w/ 21 levels "0","1","2","3",..: 11 16 21 10 8 11 15 16 8 4 ...
 $ Stay_In_Current_City_Years: Factor w/ 6 levels "","0","1","2",..: 4 5 3 3 3 5 2 6 5 3 ...
 $ Marital_Status            : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 2 ...
 $ 消费总额                  : int  333481 341635 821001 379450 1047124 1457938 185747 1279678 720850 1940043 ...
 > boxplot(users_12$ 消费总额 ,col="yellow")  #箱型图查看缺失值
> boxplot.stats(users_12$ 消费总额)
$stats
[1]   45551  281780  730131 1672669 3737504
$n
[1] 1045
$conf
[1] 662149.4 798112.6
$out
 [1]  4355777  6573609  5212846  6310604  4997527  4647555  3917492  4681205  4255176  4054112
[11]  5499812  3770941  6511302  3786677  4003012  5628295  4728932  6387899  4178546  3888766
[21]  5805353  4503530  5136424  5103795  3977702  4055317  8699232  4358776  3797112  6817493
[31]  5549841  5166938  4433272  4135916  4032859  7577505  4303859  6126540  4453785  5673106
[41]  3955182  6476786  4028509  4528519  6186498  5961987  4384924  4664260  5153189  4622308
[51]  6044178  4152683  4094730  3847749  4836540 10536783  4256751  5733683  6565878  4006176
[61]  5129726  5150348  4642305  4689382  4174884  4458155  3824963  4098692  4246978  5075337
[71]  5985405  4354802
> #建立训练集与测试集
> ind  = sample(2,nrow(users_12),replace = TRUE,prob=c(0.7,0.3))
> train=users_12[ind==1,]
> test=users_12[ind==2,]

3. 通过多元线性回归预测用户消费总额

#多元线性回归
> set.seed((12))
> users_lm=lm(消费总额~Gender+Age+Occupation+Stay_In_Current_City_Years,data=train)
> users_lm

Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years, 
    data = train)

Coefficients:
                 (Intercept)                       GenderM                      Age18-25  
                     1112974                        330122                       -204918  
                    Age26-35                      Age36-45                      Age46-50  
                        3245                       -195851                       -274017  
                    Age51-55                        Age55+                   Occupation1  
                     -573862                       -857404                        152753  
                 Occupation2                   Occupation3                   Occupation4  
                      133586                        277339                        190566  
                 Occupation5                   Occupation6                   Occupation7  
                       69889                        792490                        145483  
                 Occupation8                   Occupation9                  Occupation10  
                     -604240                       -469454                       -203505  
                Occupation11                  Occupation12                  Occupation13  
                      207628                       -313140                       -330398  
                Occupation14                  Occupation15                  Occupation16  
                      341738                         36168                        922765  
                Occupation17                  Occupation18                  Occupation19  
                      -83717                        280807                         78211  
                Occupation20   Stay_In_Current_City_Years1   Stay_In_Current_City_Years2  
                      652190                       -208613                        -33459  
 Stay_In_Current_City_Years3  Stay_In_Current_City_Years4+  
                     -185424                       -208820  

> summary(users_lm)

Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years, 
    data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1933294  -833549  -358080   413600  8366773 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)   
(Intercept)                   1112974     534216   2.083  0.03758 * 
GenderM                        330122     112565   2.933  0.00347 **
Age18-25                      -204918     499728  -0.410  0.68189   
Age26-35                         3245     501646   0.006  0.99484   
Age36-45                      -195851     510165  -0.384  0.70117   
Age46-50                      -274017     540614  -0.507  0.61241   
Age51-55                      -573863     526379  -1.090  0.27600   
Age55+                        -857404     557787  -1.537  0.12471   
Occupation1                    152753     210666   0.725  0.46864   
Occupation2                    133586     247114   0.541  0.58897   
Occupation3                    277339     298621   0.929  0.35335   
Occupation4                    190566     196994   0.967  0.33370   
Occupation5                     69890     457117   0.153  0.87853   
Occupation6                    792490     352716   2.247  0.02496 * 
Occupation7                    145483     203931   0.713  0.47584   
Occupation8                   -604240     662511  -0.912  0.36206   
Occupation9                   -469454     664999  -0.706  0.48046   
Occupation10                  -203505     513061  -0.397  0.69175   
Occupation11                   207628     364321   0.570  0.56893   
Occupation12                  -313140     230735  -1.357  0.17517   
Occupation13                  -330398     541278  -0.610  0.54179   
Occupation14                   341738     262857   1.300  0.19400   
Occupation15                    36168     316049   0.114  0.90892   
Occupation16                   922765     297426   3.103  0.00200 **
Occupation17                   -83717     248834  -0.336  0.73664   
Occupation18                   280807     594842   0.472  0.63703   
Occupation19                    78211     436344   0.179  0.85780   
Occupation20                   652190     235661   2.767  0.00580 **
Stay_In_Current_City_Years1   -208613     149594  -1.395  0.16360   
Stay_In_Current_City_Years2    -33459     171450  -0.195  0.84533   
Stay_In_Current_City_Years3   -185424     174960  -1.060  0.28960   
Stay_In_Current_City_Years4+  -208820     178018  -1.173  0.24119   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1288000 on 697 degrees of freedom
Multiple R-squared:  0.08683,	Adjusted R-squared:  0.04622 
F-statistic: 2.138 on 31 and 697 DF,  p-value: 0.0003726

> lm_predictions= predict(users_lm, test)
> #install.packages("gmodels")
> library("gmodels")
> #计算相对误差
> w_lm=mean((lm_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> w_lm
[1] 1.007764
> w__lm=mean((lm_predictions-test$消费总额)^2)
> w__lm
[1] 2.127004e+12
> plot(lm_predictions,test$消费总额)
>

其中T检验的结果还是可以的，自变量都是非常显著，p-value同样也是非常显著，但是调整后的R-squared: 0.04622 ，非常不佳，说明自变量对于因变量的解释率仅为4.6%
模型的相对误差为1.007764，绝对误差2.127004e+12

4. 通过随机森林预测用户消费总额

【 randomForest(x, y=NULL, ntree=500,importance=FALSE,localImp=FALSE, nPerm=1,mtry=3, proximity)】

x:代表需要预测的列，species是列的名称。
y：生成决策树的训练集
ntree：生成决策树的数目
nperm：计算importance时的重复次数
mtry：选择的分裂属性的个数
proximity=TRUE：表示生成临近矩阵
importance=TRUE：输出分裂属性的重要性

> #随机森林
> set.seed(123)
> library("randomForest")
> users_tree=randomForest(消费总额~.,data=train,importance=TRUE,ntree=100)
> print(users_tree)

Call:
 randomForest(formula = 消费总额 ~ ., data = train, importance = TRUE,      ntree = 100) 
               Type of random forest: regression
                     Number of trees: 100
No. of variables tried at each split: 1

          Mean of squared residuals: 1.720726e+12
                    % Var explained: 0.96
> importance((users_tree))
                             %IncMSE IncNodePurity
Gender                     4.6765780  1.445564e+13
Age                        4.0652253  4.626723e+13
Occupation                 2.7830964  9.723711e+13
Stay_In_Current_City_Years 2.3300941  3.103846e+13
Marital_Status             0.7702007  1.082766e+13
> tree_predictions= predict(users_tree, test)
> t_lm=mean((tree_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> t_lm
[1] 1.001825
> t__lm=mean((tree_predictions-test$消费总额)^2)
> t__lm
[1] 2.114469e+12


> #改变决策森林中树的数目
> users_tree_2=randomForest(消费总额~.,data=train,importance=TRUE,ntree=1000)
> print(users_tree_2)

Call:
 randomForest(formula = 消费总额 ~ ., data = train, importance = TRUE,      ntree = 1000) 
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 1

          Mean of squared residuals: 1.720764e+12
                    % Var explained: 0.95
> importance((users_tree_2))
                            %IncMSE IncNodePurity
Gender                     9.670613  1.709002e+13
Age                        6.188277  4.585156e+13
Occupation                 7.528485  1.080333e+14
Stay_In_Current_City_Years 3.679895  3.365439e+13
Marital_Status             1.603983  9.651066e+12
> tree_2_predictions= predict(users_tree_2, test)
> t2_lm=mean((tree_2_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> t2_lm
[1] 1.001088
> t2__lm=mean((tree_2_predictions-test$消费总额)^2)
> t2__lm
[1] 2.112913e+12
> 
>    
> #改变决策森林中树的数目
> users_tree_3=randomForest(消费总额~.,data=train,importance=TRUE,ntree=5)
> print(users_tree_3)

Call:
 randomForest(formula = 消费总额 ~ ., data = train, importance = TRUE,      ntree = 5) 
               Type of random forest: regression
                     Number of trees: 5
No. of variables tried at each split: 1

          Mean of squared residuals: 1.999001e+12
                    % Var explained: -15.06
> importance((users_tree_3))
                              %IncMSE IncNodePurity
Gender                      2.1048620  2.034920e+13
Age                         2.0914720  7.205510e+13
Occupation                 -0.1595652  1.474722e+14
Stay_In_Current_City_Years  0.3312179  4.161017e+13
Marital_Status              0.1408100  2.436103e+13
> tree_3_predictions= predict(users_tree_3, test)
> t3_lm=mean((tree_3_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> t3_lm
[1] 1.0148
> t3__lm=mean((tree_3_predictions-test$消费总额)^2)
> t3__lm
[1] 2.141854e+12

结果对比

树的数目	% Var explained:	测试集相对误差
100	0.96	1.001825
1000	0.95	1.001088
5	-15.06	1.0148

这里一个重要的系数是% Var explained，称为拟合优度，它的作用类似于之前回归分析中的R方。
一般来说，树数量越多，性能越好，预测越稳定，但也会减慢计算速度。通过对于模型中树的数量进行更改，可以发现在拟合优度中，最优者并非1000，而是100颗。
不过在对于测试集的检验中，误差符合认知，从5颗树到100棵，相对误差有了一个较大的提高，但是在从100到1000棵树的改变，并没有带来可以与前者媲美的改进。
可以用varImpPlot命令，画出自变量重要性排序图，

varImpPlot(users_tree)

在这里插入图片描述

思考

显而易见随机森林比多元线性回归的拟合效果了一点，但是这个拟合还是太差了，可能是由于自变量均为分类变量，拿来拟合一个数值型变量过于牵强，因为分类变量的排列组合有限，而数值型变量的变化更为复杂
针对这种类型，更适合将因变量分箱，然后用分类的方式进行拟合。
（可以试试在多元回归的模型中插入交叉项，以此来增加变化类型）

插入二次项的尝试

尝试中又发现了让人困惑的点
可以看到在加入各项的平方项之后，拟合优度从4.6%，提高到了8.8%；这和预期效果是一样的，然而在加入交叉项之后不增反降，还出现了陌生词汇“秩缺乏拟合”
可能原因：存在强相关的自变量，也就是设计矩阵非满秩的，会导致这样的结果，也有可能是你的目标是线性可分的，用线性回归就好了

> #加入平方项
> set.seed((12))
> users_jx_lm=lm(消费总额~Gender+Age+Occupation+Stay_In_Current_City_Years+Gender**2+Age**2+Occupation**2+Stay_In_Current_City_Years**2+Gender*Age+Occupation*Stay_In_Current_City_Years,data=train)
> 
> summary(users_jx_lm)

Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years + 
    Gender^2 + Age^2 + Occupation^2 + Stay_In_Current_City_Years^2 + 
    Gender * Age + Occupation * Stay_In_Current_City_Years, data = train)
#由于太长了symmary只保留了关键信息---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1259000 on 620 degrees of freedom
Multiple R-squared:  0.2238,	Adjusted R-squared:  0.08856 
F-statistic: 1.655 on 108 and 620 DF,  p-value: 0.000131


#加入交叉项和平方项
> w__jx_lm=mean((lm_jx_predictions-test$消费总额)^2)
> users_jx_lm=lm(消费总额~Gender+Age+Occupation+Stay_In_Current_City_Years+Gender**2+Age**2+Occupation**2+Stay_In_Current_City_Years**2+Gender*Age+Occupation*Stay_In_Current_City_Years*Gender+Age*Occupation+Stay_In_Current_City_Years*Age,data=train)
> summary(users_jx_lm)

Call:
lm(formula = 消费总额 ~ Gender + Age + Occupation + Stay_In_Current_City_Years + 
    Gender^2 + Age^2 + Occupation^2 + Stay_In_Current_City_Years^2 + 
    Gender * Age + Occupation * Stay_In_Current_City_Years * 
    Gender + Age * Occupation + Stay_In_Current_City_Years * 
    Age, data = train)

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1296000 on 472 degrees of freedom
Multiple R-squared:  0.3745,	Adjusted R-squared:  0.03519 
F-statistic: 1.104 on 256 and 472 DF,  p-value: 0.1805

> lm_jx_predictions= predict(users_jx_lm, test)
Warning message:
In predict.lm(users_jx_lm, test) : 用秩缺乏拟合来进行预测的结果很可能不可靠
> #计算相对误差
> w_jx_lm=mean((lm_jx_predictions-test$消费总额)^2)/ mean((mean(test$消费总额)- test$消费总额)^2)
> w_jx_lm
[1] 1.491868
> w__jx_lm=mean((lm_jx_predictions-test$消费总额)^2)
> w__jx_lm
[1] 3.148763e+12

来源：CSDN

作者：Felis catus

链接：https://blog.csdn.net/weixin_44696674/article/details/88072492

标签

新建文件夹

黑色星期五

R语言

阶段性小结(一)---R语言回归案例实战&算法比较