问题
I'm trying to bootstrap a bivariate correlation grouped by multiple variables in a tidy fashion. So far I've got:
paks <- c('dplyr','tidyr','broom')
lapply(paks, require, character.only=TRUE)
set.seed(123)
df <- data.frame(
rep(c('group1','group2','group3','group4'),25),
rep(c('subgroup1','subgroup2','subgroup3','subgroup4'),25),
rnorm(25),
rnorm(25)
)
colnames(df) <- c('group','subgroup','v1','v2')
cors_boot <- df %>%
group_by(., group,subgroup) %>%
bootstrap(., 10) %>%
do(tidy(cor.test(.$v1,.$v2)))
cors_boot
This will succesffuly run 10 replications, but will not maintain the group_by
conditions. Any help would be appreciated.
回答1:
One option is to make use of nested tibbles (using nest()
from tidyr) and iterating with functions from the purrr package. Here's an example:
df %>%
nest(-group, -subgroup) %>%
mutate(cors_boot = map(data, ~ bootstrap(., 10) %>% do(tidy(cor.test(.$v1,.$v2))))) %>%
unnest(cors_boot)
#> # A tibble: 40 × 11
#> group subgroup replicate estimate statistic p.value parameter
#> <fctr> <fctr> <int> <dbl> <dbl> <dbl> <int>
#> 1 group1 subgroup1 1 0.30199080 1.5192285 0.14233305 23
#> 2 group1 subgroup1 2 0.24782068 1.2267744 0.23231801 23
#> 3 group1 subgroup1 3 0.05697887 0.2737057 0.78675375 23
#> 4 group1 subgroup1 4 0.14141925 0.6851084 0.50012255 23
#> 5 group1 subgroup1 5 0.14769543 0.7161768 0.48109119 23
#> 6 group1 subgroup1 6 0.23407050 1.1546390 0.26009439 23
#> 7 group1 subgroup1 7 0.09388988 0.4522780 0.65530564 23
#> 8 group1 subgroup1 8 0.38602977 2.0068956 0.05665478 23
#> 9 group1 subgroup1 9 0.20248790 0.9916399 0.33169177 23
#> 10 group1 subgroup1 10 0.27430083 1.3679706 0.18453909 23
#> # ... with 30 more rows, and 4 more variables: conf.low <dbl>,
#> # conf.high <dbl>, method <fctr>, alternative <fctr>
Note that data setup is all the same except the purrr package is also loaded:
paks <- c('dplyr','tidyr','broom','purrr')
lapply(paks, require, character.only=TRUE)
set.seed(123)
df <- data.frame(
rep(c('group1','group2','group3','group4'),25),
rep(c('subgroup1','subgroup2','subgroup3','subgroup4'),25),
rnorm(25),
rnorm(25)
)
colnames(df) <- c('group','subgroup','v1','v2')
Aside, if they're new to you, I've written a little about nested tibbles in some blog posts. E.g., here.
回答2:
It appears that after the bootstrap
function, it is grouped by bootstrap replicates
instead of group
and subgroup
df %>%
group_by(group,subgroup) %>%
bootstrap(10, by_group=TRUE)
# Source: local data frame [100 x 4]
# Groups: replicate [10]
hence, you will need to regroup again after bootstrap
(pls note that your v1
and v2
in df
is recycled so the values returned from cor.test
are the same for each combination of group
and subgroup
. i changed v1
and v2
in the example below as a sanity check)
set.seed(123)
df <- data.frame(
group=rep(c('group1','group2','group3','group4'), 25),
subgroup=rep(c('subgroup1','subgroup2','subgroup3','subgroup4'), 25),
v1=rnorm(100),
v2=rnorm(100)
)
cors_boot <- df %>%
group_by(group,subgroup) %>%
bootstrap(10, by_group=TRUE) %>%
group_by(group, subgroup) %>% #add in this line to make your code work
do(tidy(cor.test(.$v1,.$v2)))
cors_boot
来源:https://stackoverflow.com/questions/42986736/bootstrapping-by-multiple-groups-in-dplyr