bootstrapping by multiple groups in dplyr

问题

I'm trying to bootstrap a bivariate correlation grouped by multiple variables in a tidy fashion. So far I've got:

paks <- c('dplyr','tidyr','broom')
lapply(paks, require, character.only=TRUE)
set.seed(123)

df <- data.frame(
  rep(c('group1','group2','group3','group4'),25),
  rep(c('subgroup1','subgroup2','subgroup3','subgroup4'),25),
  rnorm(25),
  rnorm(25)
)
colnames(df) <- c('group','subgroup','v1','v2') 

cors_boot <- df %>%
  group_by(., group,subgroup) %>% 
  bootstrap(., 10) %>% 
  do(tidy(cor.test(.$v1,.$v2)))
cors_boot

This will succesffuly run 10 replications, but will not maintain the group_by conditions. Any help would be appreciated.

回答1:

One option is to make use of nested tibbles (using nest() from tidyr) and iterating with functions from the purrr package. Here's an example:

df %>% 
  nest(-group, -subgroup) %>% 
  mutate(cors_boot = map(data, ~ bootstrap(., 10) %>% do(tidy(cor.test(.$v1,.$v2))))) %>% 
  unnest(cors_boot)
#> # A tibble: 40 × 11
#>     group  subgroup replicate   estimate statistic    p.value parameter
#>    <fctr>    <fctr>     <int>      <dbl>     <dbl>      <dbl>     <int>
#> 1  group1 subgroup1         1 0.30199080 1.5192285 0.14233305        23
#> 2  group1 subgroup1         2 0.24782068 1.2267744 0.23231801        23
#> 3  group1 subgroup1         3 0.05697887 0.2737057 0.78675375        23
#> 4  group1 subgroup1         4 0.14141925 0.6851084 0.50012255        23
#> 5  group1 subgroup1         5 0.14769543 0.7161768 0.48109119        23
#> 6  group1 subgroup1         6 0.23407050 1.1546390 0.26009439        23
#> 7  group1 subgroup1         7 0.09388988 0.4522780 0.65530564        23
#> 8  group1 subgroup1         8 0.38602977 2.0068956 0.05665478        23
#> 9  group1 subgroup1         9 0.20248790 0.9916399 0.33169177        23
#> 10 group1 subgroup1        10 0.27430083 1.3679706 0.18453909        23
#> # ... with 30 more rows, and 4 more variables: conf.low <dbl>,
#> #   conf.high <dbl>, method <fctr>, alternative <fctr>

Note that data setup is all the same except the purrr package is also loaded:

paks <- c('dplyr','tidyr','broom','purrr')
lapply(paks, require, character.only=TRUE)
set.seed(123)

df <- data.frame(
  rep(c('group1','group2','group3','group4'),25),
  rep(c('subgroup1','subgroup2','subgroup3','subgroup4'),25),
  rnorm(25),
  rnorm(25)
)
colnames(df) <- c('group','subgroup','v1','v2')

Aside, if they're new to you, I've written a little about nested tibbles in some blog posts. E.g., here.

回答2:

It appears that after the bootstrap function, it is grouped by bootstrap replicates instead of group and subgroup

df %>%
    group_by(group,subgroup) %>% 
    bootstrap(10, by_group=TRUE)
# Source: local data frame [100 x 4]
# Groups: replicate [10]

hence, you will need to regroup again after bootstrap (pls note that your v1 and v2 in df is recycled so the values returned from cor.test are the same for each combination of group and subgroup. i changed v1 and v2 in the example below as a sanity check)

set.seed(123)
df <- data.frame(
    group=rep(c('group1','group2','group3','group4'), 25),
    subgroup=rep(c('subgroup1','subgroup2','subgroup3','subgroup4'), 25),
    v1=rnorm(100),
    v2=rnorm(100)
)

cors_boot <- df %>%
    group_by(group,subgroup) %>% 
    bootstrap(10, by_group=TRUE) %>% 
    group_by(group, subgroup) %>% #add in this line to make your code work
    do(tidy(cor.test(.$v1,.$v2)))
cors_boot

来源：https://stackoverflow.com/questions/42986736/bootstrapping-by-multiple-groups-in-dplyr

标签

dplyr

broom