Writing loop/function to generate various linear regressions on same dataframe

问题

I am writing loops or functions in R, and I still haven't really understood how to do that. Currently, I need to write a loop/function (not sure which one would be better) to create several linear regression models within the same data frame.

I have data like this:

dataset <- read.table(text = 
"ID  A_2 B_2 C_2 A_1 B_1 C_1 AGE
M1  10  6   6   8   8   9   25
M2  50  69  54  67  22  44  16
M3  5   80  44  78  5   55  18
M4  60  70  52  89  3   56  28
M5  60  5   34  90  80  56  34
M6  55  55  67  60  100 77  54", header = TRUE, stringsAsFactors = FALSE)

I am building models like this:

model1 <- lm(A_2~A_1+age, data=dataset)

model2 <- lm(B_2~B_1+age, data=dataset)

model3 <- lm(C_2~C_1+age, data=dataset)

I need to write a loop which:

takes variable _2 (the dependent variable) and variable _1 (independent variable) and covariates like age ...
creates the lm models, and stores outputs (i.e, T-value, p-value, confidence intervals etc) in a data.frame that I can then print.

Dep_va  Ind_var Convarites  Pvalue  "upper.cI" "low.cI" 

A_2 A_1 age         
B_2 B_1 age         
C_2 C_1 age         
D_2 D_1 age

回答1:

Here is a base R approach to the problem with lapply loops.

First if you want to automatically extract the variable names ending in _2 which should be all of your dependent variables you could implement the following code:

dep_vars<-grep("_2$",colnames(dataset),value = T) #This selects all variables ending in `_2` which should all be dependent variables.

reg_vars<-gsub("_2$","",dep_vars) #This removes the `_2` from the dependent variables which should give you the common stem which can be used to select both dependent and independent variables from your data frame.

Then you can use this in your lapply loop for creating your formulas:

full_results <- lapply(reg_vars, function(i) summary(lm(paste0("log(",i,"_2)~",i,"_1+AGE"),data=dataset)))

Now you will have a list of linear regression summaries where you can extract the info you want. I think this is what you want for the output but please clarify if not:

print_results<-lapply(full_results,function(i) data.frame(
                                            Dep_va = names(attributes(i[["terms"]])$dataClasses)[1], 
                                            Ind_var = names(attributes(i[["terms"]])$dataClasses)[2],
                                            Covariates = names(attributes(i[["terms"]])$dataClasses)[3], 
                                            Pvalue = i[["coefficients"]][2,4],
                                            upper.cI = i[["coefficients"]][2,1]+1.96*i[["coefficients"]][2,2],
                                            low.cI = i[["coefficients"]][2,1]-1.96*i[["coefficients"]][2,2]))

That code will give you a list of data frames and if you want to combine it into one data.frame:

final_results<-do.call("rbind",print_results)

Output Results:

Dep_va Ind_var Covariates     Pvalue upper.cI     low.cI
1    A_2     A_1        AGE 0.25753805 1.113214 -0.1877324
2    B_2     B_1        AGE 0.68452053 1.211355 -1.9292236
3    C_2     C_1        AGE 0.04827506 1.497688  0.3661343

Hope that helps! Let me know if you were looking for different output results.

回答2:

Here's a tidyverse approach:

library(tidyverse)

dataset %>% 
  gather(col, val, -ID, -AGE) %>%
  separate(col, c("name", "num")) %>%
  spread(num, val) %>%
  group_by(name) %>%
  group_map(~lm(`2` ~ `1` + AGE, data = .x) %>% broom::tidy())

# A tibble: 9 x 6
# Groups:   name [3]
  name  term        estimate std.error statistic p.value
  <chr> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 A     (Intercept)  -15.0      31.5      -0.477  0.666 
2 A     `1`            0.463     0.332     1.39   0.258 
3 A     AGE            0.851     0.731     1.16   0.329 
4 B     (Intercept)   49.1      52.5       0.935  0.419 
5 B     `1`           -0.359     0.801    -0.448  0.685 
6 B     AGE            0.391     2.47      0.159  0.884 
7 C     (Intercept)    5.42     13.9       0.390  0.723 
8 C     `1`            0.932     0.289     3.23   0.0483
9 C     AGE           -0.299     0.470    -0.635  0.570

Explanation:

Move data to long/tidy format with gather()
separate the variable category (in the example data, A, B, etc)
Create a separate column for the IV and DV with spread()
Use group_by() and group_map() to apply lm() to each variable category.

回答3:

Given your particular question, where you have variable names with common bases that are labelled with _2 and _1 to designate the dependent and independent variables respectively, you could solve your issue as follows:

var.names <- names(dataset[!names(dataset) %in% c("ID","AGE")])
names.list <- strsplit(var.names,split = "_")
list.of.models <- list()
for (i in 1:length(names.list)) {
DV <- grep(names.list[[i]][1], names(dataset))[1]
IV <- grep(names.list[[i]][1], names(dataset))[2]
  list.of.models[[i]] <- lm(dataset[,DV] ~ dataset[,IV] + AGE, data = dataset)
}

summary(list.of.models[[1]])

Call:
lm(formula = dataset[, DV] ~ dataset[, IV] + AGE, data = dataset)

Residuals:
        1         2         3         4         5         6 
  0.07496  20.42938 -31.36213  10.04093   4.47412  -3.65725 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)        -15.0439    31.5482  -0.477    0.666
dataset[, IV]        0.4627     0.3319   1.394    0.258
AGE                  0.8507     0.7314   1.163    0.329

Residual standard error: 22.62 on 3 degrees of freedom
Multiple R-squared:  0.5276,    Adjusted R-squared:  0.2127 
F-statistic: 1.676 on 2 and 3 DF,  p-value: 0.3246

来源：https://stackoverflow.com/questions/56120159/writing-loop-function-to-generate-various-linear-regressions-on-same-dataframe

标签

loops

statistics