问题
I am trying to create a contingency table from a particular type of data. This would be doable with loops etc... but because my final table would contain more than 10E5 cells, I am looking for a pre-existing function.
My initial data are as follow:
PLANT ANIMAL INTERACTIONS
---------------------- ------------------------------- ------------
Tragopogon_pratensis Propylea_quatuordecimpunctata 1
Anthriscus_sylvestris Rhagonycha_nigriventris 3
Anthriscus_sylvestris Sarcophaga_carnaria 2
Heracleum_sphondylium Sarcophaga_carnaria 1
Anthriscus_sylvestris Sarcophaga_variegata 4
Anthriscus_sylvestris Sphaerophoria_interrupta_Gruppe 3
Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe 1
I would like to create a table like this:
Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
---------------------- ----------------------------- ----------------------- ------------------- -------------------- -------------------------------
Tragopogon_pratensis 1 0 0 0 0
Anthriscus_sylvestris 0 3 2 4 3
Heracleum_sphondylium 0 0 1 0 0
Cerastium_holosteoides 0 0 0 0 1
That is, all plant species in row, all animal species in columns, and sometimes there are no interactions (while my initial data only list interactions that occur).
回答1:
In base R, use table
or xtabs
:
with(warpbreaks, table(wool, tension))
tension
wool L M H
A 9 9 9
B 9 9 9
xtabs(~wool+tension, data=warpbreaks)
tension
wool L M H
A 9 9 9
B 9 9 9
The gmodels
packages has a function CrossTable
that gives output similar to what users of SPSS or SAS expects:
library(gmodels)
with(warpbreaks, CrossTable(wool, tension))
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 54
| tension
wool | L | M | H | Row Total |
-------------|-----------|-----------|-----------|-----------|
A | 9 | 9 | 9 | 27 |
| 0.000 | 0.000 | 0.000 | |
| 0.333 | 0.333 | 0.333 | 0.500 |
| 0.500 | 0.500 | 0.500 | |
| 0.167 | 0.167 | 0.167 | |
-------------|-----------|-----------|-----------|-----------|
B | 9 | 9 | 9 | 27 |
| 0.000 | 0.000 | 0.000 | |
| 0.333 | 0.333 | 0.333 | 0.500 |
| 0.500 | 0.500 | 0.500 | |
| 0.167 | 0.167 | 0.167 | |
-------------|-----------|-----------|-----------|-----------|
Column Total | 18 | 18 | 18 | 54 |
| 0.333 | 0.333 | 0.333 | |
-------------|-----------|-----------|-----------|-----------|
回答2:
the reshape
package should do the trick.
> library(reshape)
> df <- data.frame(PLANT = c("Tragopogon_pratensis","Anthriscus_sylvestris","Anthriscus_sylvestris","Heracleum_sphondylium","Anthriscus_sylvestris","Anthriscus_sylvestris","Cerastium_holosteoides"),
ANIMAL= c("Propylea_quatuordecimpunctata","Rhagonycha_nigriventris","Sarcophaga_carnaria","Sarcophaga_carnaria","Sarcophaga_variegata","Sphaerophoria_interrupta_Gruppe","Sphaerophoria_interrupta_Gruppe"),
INTERACTIONS = c(1,3,2,1,4,3,1),
stringsAsFactors=FALSE)
> df <- melt(df,id.vars=c("PLANT","ANIMAL"))
> df <- cast(df,formula=PLANT~ANIMAL)
> df <- replace(df,is.na(df),0)
> df
PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris
1 Anthriscus_sylvestris 0 3
2 Cerastium_holosteoides 0 0
3 Heracleum_sphondylium 0 0
4 Tragopogon_pratensis 1 0
Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
1 2 4 3
2 0 0 1
3 1 0 0
4 0 0 0
I'm still figuring out how to fix the order
issue, any suggestion?
回答3:
I'd like to point out that we can get the same results Andrie posted without using the function with
:
R Base Package
# 3 options
table(warpbreaks[, 2:3])
table(warpbreaks[, c("wool", "tension")])
table(warpbreaks$wool, warpbreaks$tension, dnn = c("wool", "tension"))
tension
wool L M H
A 9 9 9
B 9 9 9
Package gmodels:
library(gmodels)
# 2 options
CrossTable(warpbreaks$wool, warpbreaks$tension)
CrossTable(warpbreaks$wool, warpbreaks$tension, dnn = c("Wool", "Tension"))
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 54
| warpbreaks$tension
warpbreaks$wool | L | M | H | Row Total |
----------------|-----------|-----------|-----------|-----------|
A | 9 | 9 | 9 | 27 |
| 0.000 | 0.000 | 0.000 | |
| 0.333 | 0.333 | 0.333 | 0.500 |
| 0.500 | 0.500 | 0.500 | |
| 0.167 | 0.167 | 0.167 | |
----------------|-----------|-----------|-----------|-----------|
B | 9 | 9 | 9 | 27 |
| 0.000 | 0.000 | 0.000 | |
| 0.333 | 0.333 | 0.333 | 0.500 |
| 0.500 | 0.500 | 0.500 | |
| 0.167 | 0.167 | 0.167 | |
----------------|-----------|-----------|-----------|-----------|
Column Total | 18 | 18 | 18 | 54 |
| 0.333 | 0.333 | 0.333 | |
----------------|-----------|-----------|-----------|-----------|
回答4:
xtabs in base R should work, for example:
dat <- data.frame(PLANT = c("p1", "p2", "p2", "p4", "p5", "p5", "p6"),
ANIMAL = c("a1", "a2", "a3", "a3", "a4", "a5", "a5"),
INTERACTIONS = c(1,3,2,1,4,3,1),
stringsAsFactors = FALSE)
(x2.table <- xtabs(dat$INTERACTIONS ~ dat$PLANT + dat$ANIMAL))
dat$ANIMAL
dat$PLANT a1 a2 a3 a4 a5
p1 1 0 0 0 0
p2 0 3 2 0 0
p4 0 0 1 0 0
p5 0 0 0 4 3
p6 0 0 0 0 1
chisq.test(x2.table, simulate.p.value = TRUE)
I think that should do what you're looking for fairly easily. I'm not sure how it scales up in terms of efficiency to a 10E5 contingency table, but that might be a separate issue statistically.
回答5:
With dplyr / tidyr
:
df <- read.table(text='PLANT ANIMAL INTERACTIONS
Tragopogon_pratensis Propylea_quatuordecimpunctata 1
Anthriscus_sylvestris Rhagonycha_nigriventris 3
Anthriscus_sylvestris Sarcophaga_carnaria 2
Heracleum_sphondylium Sarcophaga_carnaria 1
Anthriscus_sylvestris Sarcophaga_variegata 4
Anthriscus_sylvestris Sphaerophoria_interrupta_Gruppe 3
Cerastium_holosteoides Sphaerophoria_interrupta_Gruppe 1', header=TRUE)
library(dplyr)
library(tidyr)
df %>% spread(ANIMAL, INTERACTIONS, fill=0)
# PLANT Propylea_quatuordecimpunctata Rhagonycha_nigriventris Sarcophaga_carnaria Sarcophaga_variegata Sphaerophoria_interrupta_Gruppe
# 1 Anthriscus_sylvestris 0 3 2 4 3
# 2 Cerastium_holosteoides 0 0 0 0 1
# 3 Heracleum_sphondylium 0 0 1 0 0
# 4 Tragopogon_pratensis 1 0 0 0 0
回答6:
Simply use dcast()
function of "reshape2
" package:
ans = dcast( df, PLANT~ ANIMAL,value.var = "INTERACTIONS", fill = 0 )
Here "PLANT" will be on the left column, "ANIMALS" on the top row, filling of the table will happen using "INTERACTIONS" and "NULL" values will be filled using 0's.
来源:https://stackoverflow.com/questions/7442207/how-do-i-get-a-contingency-table