Using R, apply multiple chi-square contingency table tests to a grouped data frame and add a new column containing the p values of the tests

问题

I have a data frame similar to the example below (which is a small extract of my actual data frame).

frequencies <- data.frame(sex=c("female", "female", "male", "male", "female", "female", "male", "male", "female", "female", "male", "male", "female", "female", "male", "male"),
                      ecotype=c("Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave", "Crab", "Wave"),
                      contig_ID=c("Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", "Contig100169_2367", 
                                  "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481", "Contig100169_2481"),
                      allele=c("p", "p", "p", "p", "q", "q", "q", "q", "p", "p", "p", "p", "q", "q", "q", "q"),
                      frequency=c(157, 98, 140, 65, 29, 8, 26, 9, 182, 108, 147, 80, 46, 4, 49, 4))

I would like to do separate chi-square contingency tests for each combination of ‘contig_ID’ and ‘ecotype’, testing the association between ‘sex’ and ‘allele’. I would then like to summarise the results of these in a table that includes the p value for each combination of ‘contig_ID’ and ‘ecotype’. For instance, from the example table given, I would expect a results table of 4 p values like the example below.

results <- data.frame(ecotype=c("Crab", "Wave", "Crab", "Wave"),
                  contig_ID=c("Contig100169_2367", "Contig100169_2367", "Contig100169_2481", "Contig100169_2481"),
                  pvalue=c("pval", "pval", "pval", "pval"))

Alternatively, just adding a p value column to the original table would also work, with the p value for each combination just repeated in all the relevant rows.

I have been attempting to use functions such as lapply() and summarise() in combination with chisq.test() to achieve this but have had no luck so far. I have also attempted to use a method similar to this: R chi squared test (3x2 contingency table) for each row in a table , but couldn't make this work either.

回答1:

We can group the contig_ID and ecotype columns and created a nested data frame with the data converted to a matrix as follows.

library(tidyverse)

frequencies2 <- frequencies %>%
  group_by(contig_ID, ecotype) %>%
  nest() %>%
  mutate(M = map(data, function(dat){
    dat2 <- dat %>% spread(sex, frequency)
    M <- as.matrix(dat2[, -1])
    row.names(M) <- dat2$allele
    return(M)
  }))

If we look at the first element of the M column, we will find out that data from each group were converted to a matrix.

frequencies2$M[[1]]
#   female male
# p    157  140
# q     29   26

From here, we can applied the chisq.test to each matrix and pull out the p value. frequencies3 is the final output.

frequencies3 <- frequencies2 %>%
  mutate(pvalue = map_dbl(M, ~chisq.test(.x)$p.value)) %>%
  select(-data, -M) %>%
  ungroup()
frequencies3
# # A tibble: 4 x 3
#   contig_ID         ecotype pvalue
#   <fct>             <fct>    <dbl>
# 1 Contig100169_2367 Crab     1.00 
# 2 Contig100169_2367 Wave     0.434
# 3 Contig100169_2481 Crab     0.284
# 4 Contig100169_2481 Wave     0.958

来源：https://stackoverflow.com/questions/49659103/using-r-apply-multiple-chi-square-contingency-table-tests-to-a-grouped-data-fra

标签

tidyverse

chi-squared