问题
I have seen a lot of similar questions, but there is one key to the loop that I am trying to write that I am missing. I have a a set of dataset with ~4,000 different keys, and for each key, there are ~1,000 observations. I have filtered out a key to isolate the observations for that key, run linear regression, checked model assumptions and all looks good. However, I want to loop over this dataset and run that linear regression for each of the keys. Then I will want to store the coefficients, pvalues, R^2, etc and review them together.
Here is a sample of my data:
Key y1 x1 x2
A 10 1 3
A 11 2 4
A 12 3 5
B 13 4 6
B 14 5 7
B 15 6 8
C 16 7 9
C 17 8 1
C 18 9 2
I have run:
datA <- data %>% filter(key=='A')
lm(y1 ~ x1 + x2, data = datA)
And then repeated that for keys B and C. Each question that I have seen on here is looking at the looping over the different variables for the entire set, but not splitting the data on the rows.
But I need to do this 4,000 more times. Any assistance to write this loop would be greatly appreciated (I am terrible at writing loops).
回答1:
Can also use the broom package to tidy the output into a more readable form.
list_models <- lapply(split(data, data$Key), function(x) lm(y1 ~ x1 + x2, data = x))
library(broom)
as_tibble(do.call(rbind, lapply(list_models, broom::tidy)))
# A tibble: 7 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.00e+ 0 2.22e-15 4.05e15 1.57e-16
2 x1 1.00e+ 0 1.03e-15 9.73e14 6.54e-16
3 (Intercept) 9.00e+ 0 4.59e-15 1.96e15 3.25e-16
4 x1 1.00e+ 0 9.06e-16 1.10e15 5.77e-16
5 (Intercept) 9.00e+ 0 NaN NaN NaN
6 x1 1.00e+ 0 NaN NaN NaN
7 x2 3.02e-16 NaN NaN NaN
回答2:
You could split
the data and apply lm
to each chunk.
list_models <- lapply(split(df, df$Key), function(x) lm(y1 ~ x1 + x2, data = x))
A tidyverse
way would be :
library(dplyr)
library(purrr)
list_models <- df %>% group_split(Key) %>% map(~lm(y1 ~ x1 + x2, data = .x))
It returns a model for each individual Key
.
list_models
#$A
#Call:
#lm(formula = y1 ~ x1 + x2, data = x)
#Coefficients:
#(Intercept) x1 x2
# 9 1 NA
#$B
#Call:
#lm(formula = y1 ~ x1 + x2, data = x)
#Coefficients:
#(Intercept) x1 x2
# 9 1 NA
#$C
#Call:
#lm(formula = y1 ~ x1 + x2, data = x)
#Coefficients:
#(Intercept) x1 x2
# 9.00e+00 1.00e+00 7.86e-16
回答3:
Much less elegant than @RonakShah's answer, you can group by your key and summarise for each key while extracting values of interest in order to get the following table:
library(dplyr)
df %>% group_by(Key) %>%
summarise(Intercept = lm(y1 ~ x1 + x2)$coefficients[1],
Coeff_x1 = lm(y1 ~ x1 + x2)$coefficients[2],
Coeff_x2 = lm(y1 ~ x1 + x2)$coefficients[3],
R2 = summary(lm(y1 ~ x1 + x2))$r.squared,
pvalue = summary(lm(y1 ~ x1 + x2))$coefficients["x1",4])
# A tibble: 3 x 6
Key Intercept Coeff_x1 Coeff_x2 R2 pvalue
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 9. 1.00 NA 1 8.00e-16
2 B 9. 1.00 NA 1 7.00e-16
3 C 9. 1.00 7.86e-16 1 NaN
来源:https://stackoverflow.com/questions/60962181/splitting-data-and-running-linear-regression-loop