How do I get a contingency table?

问题

I am trying to create a contingency table from a particular type of data. This would be doable with loops etc... but because my final table would contain more than 10E5 cells, I am looking for a pre-existing function.

My initial data are as follow:

PLANT                  ANIMAL                          INTERACTIONS
---------------------- ------------------------------- ------------
Tragopogon_pratensis   Propylea_quatuordecimpunctata         1
Anthriscus_sylvestris  Rhagonycha_nigriventris               3
Anthriscus_sylvestris  Sarcophaga_carnaria                   2
Heracleum_sphondylium  Sarcophaga_carnaria                   1
Anthriscus_sylvestris  Sarcophaga_variegata                  4
Anthriscus_sylvestris  Sphaerophoria_interrupta_Gruppe       3
Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe       1

I would like to create a table like this:

                       Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
---------------------- ----------------------------- ----------------------- ------------------- -------------------- -------------------------------
Tragopogon_pratensis   1                             0                       0                   0                    0
Anthriscus_sylvestris  0                             3                       2                   4                    3
Heracleum_sphondylium  0                             0                       1                   0                    0
Cerastium_holosteoides 0                             0                       0                   0                    1

That is, all plant species in row, all animal species in columns, and sometimes there are no interactions (while my initial data only list interactions that occur).

回答1:

In base R, use table or xtabs:

with(warpbreaks, table(wool, tension))

    tension
wool L M H
   A 9 9 9
   B 9 9 9

xtabs(~wool+tension, data=warpbreaks)

    tension
wool L M H
   A 9 9 9
   B 9 9 9

The gmodels packages has a function CrossTable that gives output similar to what users of SPSS or SAS expects:

library(gmodels)
with(warpbreaks, CrossTable(wool, tension))


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  54 


             | tension 
        wool |         L |         M |         H | Row Total | 
-------------|-----------|-----------|-----------|-----------|
           A |         9 |         9 |         9 |        27 | 
             |     0.000 |     0.000 |     0.000 |           | 
             |     0.333 |     0.333 |     0.333 |     0.500 | 
             |     0.500 |     0.500 |     0.500 |           | 
             |     0.167 |     0.167 |     0.167 |           | 
-------------|-----------|-----------|-----------|-----------|
           B |         9 |         9 |         9 |        27 | 
             |     0.000 |     0.000 |     0.000 |           | 
             |     0.333 |     0.333 |     0.333 |     0.500 | 
             |     0.500 |     0.500 |     0.500 |           | 
             |     0.167 |     0.167 |     0.167 |           | 
-------------|-----------|-----------|-----------|-----------|
Column Total |        18 |        18 |        18 |        54 | 
             |     0.333 |     0.333 |     0.333 |           | 
-------------|-----------|-----------|-----------|-----------|

回答2:

the reshape package should do the trick.

> library(reshape)

> df <- data.frame(PLANT = c("Tragopogon_pratensis","Anthriscus_sylvestris","Anthriscus_sylvestris","Heracleum_sphondylium","Anthriscus_sylvestris","Anthriscus_sylvestris","Cerastium_holosteoides"),
                   ANIMAL= c("Propylea_quatuordecimpunctata","Rhagonycha_nigriventris","Sarcophaga_carnaria","Sarcophaga_carnaria","Sarcophaga_variegata","Sphaerophoria_interrupta_Gruppe","Sphaerophoria_interrupta_Gruppe"),
                   INTERACTIONS = c(1,3,2,1,4,3,1),
                   stringsAsFactors=FALSE)

> df <- melt(df,id.vars=c("PLANT","ANIMAL"))    
> df <- cast(df,formula=PLANT~ANIMAL)
> df <- replace(df,is.na(df),0)

> df
                   PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris
1  Anthriscus_sylvestris                             0                       3
2 Cerastium_holosteoides                             0                       0
3  Heracleum_sphondylium                             0                       0
4   Tragopogon_pratensis                             1                       0
  Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
1                   2                    4                               3
2                   0                    0                               1
3                   1                    0                               0
4                   0                    0                               0

I'm still figuring out how to fix the order issue, any suggestion?

回答3:

I'd like to point out that we can get the same results Andrie posted without using the function with:

R Base Package

# 3 options
table(warpbreaks[, 2:3])
table(warpbreaks[, c("wool", "tension")])
table(warpbreaks$wool, warpbreaks$tension, dnn = c("wool", "tension"))

    tension
wool L M H
   A 9 9 9
   B 9 9 9

Package gmodels:

library(gmodels)
# 2 options    
CrossTable(warpbreaks$wool, warpbreaks$tension)
CrossTable(warpbreaks$wool, warpbreaks$tension, dnn = c("Wool", "Tension"))


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  54 


                | warpbreaks$tension 
warpbreaks$wool |         L |         M |         H | Row Total | 
----------------|-----------|-----------|-----------|-----------|
              A |         9 |         9 |         9 |        27 | 
                |     0.000 |     0.000 |     0.000 |           | 
                |     0.333 |     0.333 |     0.333 |     0.500 | 
                |     0.500 |     0.500 |     0.500 |           | 
                |     0.167 |     0.167 |     0.167 |           | 
----------------|-----------|-----------|-----------|-----------|
              B |         9 |         9 |         9 |        27 | 
                |     0.000 |     0.000 |     0.000 |           | 
                |     0.333 |     0.333 |     0.333 |     0.500 | 
                |     0.500 |     0.500 |     0.500 |           | 
                |     0.167 |     0.167 |     0.167 |           | 
----------------|-----------|-----------|-----------|-----------|
   Column Total |        18 |        18 |        18 |        54 | 
                |     0.333 |     0.333 |     0.333 |           | 
----------------|-----------|-----------|-----------|-----------|

回答4:

xtabs in base R should work, for example:

dat <- data.frame(PLANT = c("p1", "p2", "p2", "p4", "p5", "p5", "p6"),
                  ANIMAL = c("a1", "a2", "a3", "a3", "a4", "a5", "a5"),
                  INTERACTIONS = c(1,3,2,1,4,3,1),
                  stringsAsFactors = FALSE)

(x2.table <- xtabs(dat$INTERACTIONS ~ dat$PLANT + dat$ANIMAL))

     dat$ANIMAL
dat$PLANT a1 a2 a3 a4 a5
       p1  1  0  0  0  0
       p2  0  3  2  0  0
       p4  0  0  1  0  0
       p5  0  0  0  4  3
       p6  0  0  0  0  1

chisq.test(x2.table, simulate.p.value = TRUE)

I think that should do what you're looking for fairly easily. I'm not sure how it scales up in terms of efficiency to a 10E5 contingency table, but that might be a separate issue statistically.

回答5:

With dplyr / tidyr:

df <- read.table(text='PLANT                  ANIMAL                          INTERACTIONS
                 Tragopogon_pratensis   Propylea_quatuordecimpunctata         1
                 Anthriscus_sylvestris  Rhagonycha_nigriventris               3
                 Anthriscus_sylvestris  Sarcophaga_carnaria                   2
                 Heracleum_sphondylium  Sarcophaga_carnaria                   1
                 Anthriscus_sylvestris  Sarcophaga_variegata                  4
                 Anthriscus_sylvestris  Sphaerophoria_interrupta_Gruppe       3
                 Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe       1', header=TRUE)
library(dplyr)
library(tidyr)
df %>% spread(ANIMAL, INTERACTIONS, fill=0)

#                    PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
# 1  Anthriscus_sylvestris                             0                       3                   2                    4                               3
# 2 Cerastium_holosteoides                             0                       0                   0                    0                               1
# 3  Heracleum_sphondylium                             0                       0                   1                    0                               0
# 4   Tragopogon_pratensis                             1                       0                   0                    0                               0

回答6:

Simply use dcast() function of "reshape2" package:

ans = dcast( df, PLANT~ ANIMAL,value.var = "INTERACTIONS", fill = 0 )

Here "PLANT" will be on the left column, "ANIMALS" on the top row, filling of the table will happen using "INTERACTIONS" and "NULL" values will be filled using 0's.

来源：https://stackoverflow.com/questions/7442207/how-do-i-get-a-contingency-table

标签

contingency