Automate regression with specific dependent and independent variables

。_饼干妹妹 提交于 2019-12-13 07:06:05

问题


MVE: Let this be the data set:

data <- data.frame(year = rep(seq(1966,2015,1), 8), 
               county = c(rep('prva', 50), rep('druga', 50), rep('treća', 50), rep('četvrta', 50),
                          rep('peta', 50), rep('šesta', 50), rep('sedma', 50), rep('osma', 50)),
               crime1 = runif(400), crime2 = runif(400), crime3 = runif(400), 
               uvar1 = runif(400), uvar2 = runif(400), uvar3 = runif(400),
               var1 = runif(400), var2 = runif(400), var3 = runif(400), var4 = runif(400), var5 = runif(400))

Let's say crime1,2 and 3 are specific dependent variables. uvar1,2 and 3 are specific independent variables. var1,2 etc. are other covariates. What I'm trying to do is automate the regressions.

Namely, I want to get the result of this code:

plm(log(crime1) = log(univar1) + log(var1) + log(var2) + log(var3) + log(var4), model = 'within', effect = 'twoways', data = data)

plm(log(crime2) = log(univar2) + log(var1) + log(var2) + log(var3) + log(var4), model = 'within', effect = 'twoways', data = data)

etc.; but without writing 20 lines of code for each estimated model.

By looking at similar questions, this is as far as I'd come:

crime <- c('crime1', 'crime2', 'crime3')
plm.results <- lapply(data[, crime], function(y) plm(y ~ var1 + var2 + var3 + var4, 
                                                     model = 'within', effect ='twoways', data = data))

Which certainly helps for my dependent variables, but I cannot figure how to include specific independent variables in each of these estimations. To clarify once more, I want univar1 to be in the first regression, but not in the rest of them etc.


回答1:


formula function is helpful when creating multiple sets of models. You could incorporate variations using combination of paste0 and formula with lapply to traverse the indices 1 to 3.

#remember to set.seed when sampling from distributions

set.seed(123)

#a helper function to create "log(var)" from "var"
fn_appendLog = function(x) {
 paste0("log(",x,")")
}



modelList = lapply(1:3,function(x) {


indepVars2 = Reduce(function(x,y) paste(x,y,sep="+"),lapply(colnames(regDF)[grepl("^v",colnames(regDF))],fn_appendLog))

#> indepVars2
#[1] "log(var1)+log(var2)+log(var3)+log(var4)+log(var5)"


indepVars1 = fn_appendLog(paste0("uvar",x))

depVar = fn_appendLog(paste0("crime",x))

formulaVar = formula(paste0(depVar, " ~ ",indepVars1,"+", indepVars2))

#> formulaVar
#log(crime1) ~ log(uvar1) + log(var1) + log(var2) + log(var3) +  log(var4) + log(var5)


modelObj = plm(formulaVar, model = 'within', effect = 'twoways', data = regDF)


})

Summary:

summary(modelList[[1]])

#> summary(modelList[[1]])
#Twoways effects Within Model
#
#Call:
#plm(formula = formulaVar, data = regDF, effect = "twoways", model = "within")
#
#Balanced Panel: n=50, T=8, N=400
#
#Residuals :
#   Min. 1st Qu.  Median 3rd Qu.    Max. 
# -5.730  -0.396   0.116   0.599   1.520 
#
#Coefficients :
#             Estimate Std. Error t-value Pr(>|t|)
#log(uvar1)  0.0393871  0.0490891  0.8024   0.4229
#log(var1)  -0.0369356  0.0541029 -0.6827   0.4953
#log(var2)  -0.0455269  0.0543664 -0.8374   0.4030
#log(var3)   0.0150516  0.0520347  0.2893   0.7726
#log(var4)  -0.0034534  0.0441506 -0.0782   0.9377
#log(var5)  -0.0109038  0.0527446 -0.2067   0.8363
#
#Total Sum of Squares:    302.23
#Residual Sum of Squares: 300.6
#R-Squared:      0.0053896
#Adj. R-Squared: 0.0045407
#F-statistic: 0.304357 on 6 and 337 DF, p-value: 0.93448

Explanation:

The independent variables are of two type, first uvar1 and others var1...varN.

1) colnames(regDF)[grepl("^v",colnames(regDF))] this will give us a list of all variables in regDF which match pattern of beginning with letter 'v' with caret symbol signifying start of the string and $ as end of the string, output at this stage is c("var1","var2"...,"var5")

2) We need log variants of this variable vector hence we pass them through lapply to the function fn_appendLog, which results in the list output of list("log(var1)","log(var2)",...,"log(var5)")

3) Next, we need these variables transformed as log(var1)+log(var2)...+log(var5)

4) To do so, we use function Reduce with the function paste(x,y,sep="+"), this takes each element of the above list with adjacent element and joins together with the seperator as "+"

   step1 = (log(var1)+log(var2))
   step2 = (log(var1)+log(var2)) + log(var3)
   step3 = (log(var1)+log(var2)+log(var3))+ log(var4) and so on

5) The function Reduce applies the function to the list and aggregates the output into a single vector resulting the final output of log(var1)+log(var2)+log(var3)+log(var4)+log(var5)

This might seem intimidating at first but as you use them often and explore examples they will part of you repertoire in no time.The best way to learn about a function say lapply is to read the documentation of ?lapply end to end and execute listed examples, tinker with parameters and gain familiarity. Hope this sheds some light on your query.



来源:https://stackoverflow.com/questions/43209809/automate-regression-with-specific-dependent-and-independent-variables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!