Automate regression with specific dependent and independent variables

问题

MVE: Let this be the data set:

data <- data.frame(year = rep(seq(1966,2015,1), 8), 
               county = c(rep('prva', 50), rep('druga', 50), rep('treća', 50), rep('četvrta', 50),
                          rep('peta', 50), rep('šesta', 50), rep('sedma', 50), rep('osma', 50)),
               crime1 = runif(400), crime2 = runif(400), crime3 = runif(400), 
               uvar1 = runif(400), uvar2 = runif(400), uvar3 = runif(400),
               var1 = runif(400), var2 = runif(400), var3 = runif(400), var4 = runif(400), var5 = runif(400))

Let's say crime1,2 and 3 are specific dependent variables. uvar1,2 and 3 are specific independent variables. var1,2 etc. are other covariates. What I'm trying to do is automate the regressions.

Namely, I want to get the result of this code:

plm(log(crime1) = log(univar1) + log(var1) + log(var2) + log(var3) + log(var4), model = 'within', effect = 'twoways', data = data)

plm(log(crime2) = log(univar2) + log(var1) + log(var2) + log(var3) + log(var4), model = 'within', effect = 'twoways', data = data)

etc.; but without writing 20 lines of code for each estimated model.

By looking at similar questions, this is as far as I'd come:

crime <- c('crime1', 'crime2', 'crime3')
plm.results <- lapply(data[, crime], function(y) plm(y ~ var1 + var2 + var3 + var4, 
                                                     model = 'within', effect ='twoways', data = data))

Which certainly helps for my dependent variables, but I cannot figure how to include specific independent variables in each of these estimations. To clarify once more, I want univar1 to be in the first regression, but not in the rest of them etc.

回答1:

formula function is helpful when creating multiple sets of models. You could incorporate variations using combination of paste0 and formula with lapply to traverse the indices 1 to 3.

#remember to set.seed when sampling from distributions

set.seed(123)

#a helper function to create "log(var)" from "var"
fn_appendLog = function(x) {
 paste0("log(",x,")")
}



modelList = lapply(1:3,function(x) {


indepVars2 = Reduce(function(x,y) paste(x,y,sep="+"),lapply(colnames(regDF)[grepl("^v",colnames(regDF))],fn_appendLog))

#> indepVars2
#[1] "log(var1)+log(var2)+log(var3)+log(var4)+log(var5)"


indepVars1 = fn_appendLog(paste0("uvar",x))

depVar = fn_appendLog(paste0("crime",x))

formulaVar = formula(paste0(depVar, " ~ ",indepVars1,"+", indepVars2))

#> formulaVar
#log(crime1) ~ log(uvar1) + log(var1) + log(var2) + log(var3) +  log(var4) + log(var5)


modelObj = plm(formulaVar, model = 'within', effect = 'twoways', data = regDF)


})

Summary:

summary(modelList[[1]])

#> summary(modelList[[1]])
#Twoways effects Within Model
#
#Call:
#plm(formula = formulaVar, data = regDF, effect = "twoways", model = "within")
#
#Balanced Panel: n=50, T=8, N=400
#
#Residuals :
#   Min. 1st Qu.  Median 3rd Qu.    Max. 
# -5.730  -0.396   0.116   0.599   1.520 
#
#Coefficients :
#             Estimate Std. Error t-value Pr(>|t|)
#log(uvar1)  0.0393871  0.0490891  0.8024   0.4229
#log(var1)  -0.0369356  0.0541029 -0.6827   0.4953
#log(var2)  -0.0455269  0.0543664 -0.8374   0.4030
#log(var3)   0.0150516  0.0520347  0.2893   0.7726
#log(var4)  -0.0034534  0.0441506 -0.0782   0.9377
#log(var5)  -0.0109038  0.0527446 -0.2067   0.8363
#
#Total Sum of Squares:    302.23
#Residual Sum of Squares: 300.6
#R-Squared:      0.0053896
#Adj. R-Squared: 0.0045407
#F-statistic: 0.304357 on 6 and 337 DF, p-value: 0.93448

Explanation:

The independent variables are of two type, first uvar1 and others var1...varN.

1) colnames(regDF)[grepl("^v",colnames(regDF))] this will give us a list of all variables in regDF which match pattern of beginning with letter 'v' with caret symbol signifying start of the string and $ as end of the string, output at this stage is c("var1","var2"...,"var5")

2) We need log variants of this variable vector hence we pass them through lapply to the function fn_appendLog, which results in the list output of list("log(var1)","log(var2)",...,"log(var5)")

3) Next, we need these variables transformed as log(var1)+log(var2)...+log(var5)

4) To do so, we use function Reduce with the function paste(x,y,sep="+"), this takes each element of the above list with adjacent element and joins together with the seperator as "+"

   step1 = (log(var1)+log(var2))
   step2 = (log(var1)+log(var2)) + log(var3)
   step3 = (log(var1)+log(var2)+log(var3))+ log(var4) and so on

5) The function Reduce applies the function to the list and aggregates the output into a single vector resulting the final output of log(var1)+log(var2)+log(var3)+log(var4)+log(var5)

This might seem intimidating at first but as you use them often and explore examples they will part of you repertoire in no time.The best way to learn about a function say lapply is to read the documentation of ?lapply end to end and execute listed examples, tinker with parameters and gain familiarity. Hope this sheds some light on your query.

来源：https://stackoverflow.com/questions/43209809/automate-regression-with-specific-dependent-and-independent-variables

标签

automation

regression

plm