Get `chisq.test()$p.value` for several groups using `dplyr::group_by()`

半世苍凉 提交于 2019-12-12 07:23:01

问题


I'm trying to conduct a chi square test on several groups within the dplyr frame. The problem is, group_by() %>% summarise() doesn't seem to do trick.

Simulated data (same structure as problematic data, but random, so p.values should be high)

set.seed(1)
data.frame(partido=sample(c("PRI", "PAN"), 100, 0.6),
       genero=sample(c("H", "M"), 100, 0.7), 
       GM=sample(c("Bajo", "Muy bajo"), 100, 0.8)) -> foo

I want to compare several groups defined by GM to see if there are changes in the p.values for the crosstab of partido and genero, conditional to GM.

The obvious dplyr way should be:

foo %>% 
  group_by(GM) %>% 
  summarise(pvalue=chisq.test(.$partido, .$genero)$p.value)  #just the p.value, so summarise is happy

But I get the p.values for the ungrouped data, just to times, not the p.value for each table:

# A tibble: 2 × 2 GM pvalue <fctr> <dbl> 1 Bajo 0.8660521 2 Muy bajo 0.8660521

Testing each group using filter I get:

foo %>% 
  filter(GM=="Bajo") %$% 
  table(partido, genero) %>% 
  chisq.test()

Returns: X-squared = 0.015655, df = 1, p-value = 0.9004

foo %>% 
  filter(GM=="Muy bajo") %$% 
  table(partido, genero) %>% chisq.test()

Returns: X-squared = 0.50409, df = 1, p-value = 0.4777

dplyr:summarise() works with functions with more than one argument, so this shouldn't be the problem:

data.frame(a=1:10, b=10:1, c=sample(c("Grupo 1", "Grupo 2"), 10, 0.5)) %>% 
    group_by(c) %>% 
    summarise(r=cor(a, b))

works like charm. It just doesn't seem to work with chisq.test.

I managed to get what I wanted with nested models using tidyr::nest() and purrr::map(), but I find the code cumbersome --at least for my students. Actually, I´ve invested many ours teaching them (a very math and programming challenged group) dplyr so they could avoid vector functions as much as possible.

foo %>% 
  nest(-GM) %>% 
  mutate(tabla=map(data, ~table(.))) %>% 
  mutate(pvalue=map(tabla, ~chisq.test(.)$p.value)) %>% 
  select(GM, pvalue) %>% 
  unnest()

A tibble: 2 × 2
       GM   pvalue
    <fctr>  <dbl>
1     Bajo  0.9004276
2 Muy bajo  0.4777095

do() does the trick too:

foo %>% 
  group_by(GM) %>% 
  do(tidy(chisq.test(.$partido, .$genero)))

Source: local data frame [2 x 5]
Groups: GM [2]
    GM statistic   p.value parameter
<fctr>     <dbl>     <dbl>     <int>
1     Bajo 0.0156553 0.9004276         1
2 Muy bajo 0.5040878 0.4777095         1
# ... with 1 more variables: method <fctr>

as in: Fisher's and Pearson's test for indepedence

But, ¿why doesn't group_by() work with summarise(chisq.test()$p.value)?


回答1:


In dplyr you can generally just use unquoted variable names to access the relevant columns, whether you're in a groupby or otherwise. So removing the .$ accessors from .$partido and .$genero which are not needed I get:

foo %>% 
    group_by(GM) %>% 
    summarise(pvalue= chisq.test(partido, genero)$p.value) 

# A tibble: 2 × 2
        GM    pvalue
    <fctr>     <dbl>
1     Bajo 0.9004276
2 Muy bajo 0.4777095


来源:https://stackoverflow.com/questions/42991993/get-chisq-testp-value-for-several-groups-using-dplyrgroup-by

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!