Can someone tell me why R is not using the whole data.frame for this chisq.test?

问题

I can't come up with a solution to a problem I've had when trying to create my own data.frame and run a quantitative analysis (such as a chisq.test) on it.

The backdrop is as follows: I've summarized data I received relating to two hospitals. Both measured the same categorical variable n number of times. In this case it's how frequently health-care associated bacteria were found during a specific observation period.

In a table, the summarized data looks as follows, where % indicates the percentage of all measurements made during the time period.

                                    n Hospital 1 (%)      n Hospital 2 (%)
Healthcare associated bacteria          829 (59.4)            578 (57.6)
Community associated bacteria           473 (33.9)            372 (37.1)
Contaminants                             94 (6.7)              53 (5.3)
Total                                  1396 (100.0)          1003 (100.0)

Now looking at the percentages, it's evident that the proportions are very similar and you may wonder why on earth I want to statistically compare the two hospitals. But I have other data, where the proportions are different and so the aims of this question is:

How to compare Hospital 1 to Hospital 2 with regards to the categories measured.

As the data is provided in a summarized fashion and in an array format, I decided to make a data.frame for each of the single variables/categories.

hosp1 <- rep(c("Yes", "No"), times=c(829,567))
hosp2 <- rep(c("Yes", "No"), times=c(578,425))
all <- cbind(hosp1, c(hosp2,rep(NA, length(hosp1)-length(hosp2))))
all <- data.frame(all)
names(all)[2]<-"hosp2"
summary(all)

So far so good, because the summary seems to look right to be able to now run a chisq.test(). But now, things get weird.

with(all, chisq.test(hosp1, hosp2, correct=F))

    Pearson's Chi-squared test

data:  hosp1 and hosp2
X-squared = 286.3087, df = 1, p-value < 2.2e-16

The results, seem to indicate that there's a significant difference. If you crosstabulate the data, you see that R is summarizing it in a very strange way:

with(all, table(hosp1, hosp2))

       No Yes
  No  174   0
  Yes 251 578

So of course if the data is summarized in that way, there'll be a statistically significant finding - because one category is summarized as having no measurments at all. Why on earth is this happening and what can I do to correct it? Finally, instead of having to make a separate data.frame for each category, is there anyway evident function to loop it? I can't come up with one.

Thanks for your help!

UPDATED BASED ON THELATEMAIL'S REQUEST FOR RAW DATA.FRAME

dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community", 
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L, 
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L, 
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L, 
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L, 
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L, 
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type", 
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType", 
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType", 
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType", 
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA, 
-3L))

Explanation: This data.frame is actually more complicated than what is summarized in the table above, as it also contains where the specific types of bacteria where cultured (i.e. in wounds, blood cultures, catheters etc.). So the table that I'm making actually looks as follows:

                                                 All locations
                                n Hospital 1 (%)      n Hospital 2 (%)  p-val
Healthcare associated bacteria     829 (59.4)            578 (57.6)     0.39
Community associated bacteria      473 (33.9)            372 (37.1)     ...
Contaminants                       94 (6.7)              53 (5.3)       ...
Total                              1396 (100.0)          1003 (100.0)   -

Where the heading "All locations", will then subsequently be replaced by wound, blood, urine, catheter etc.

回答1:

The answer to the question about how to get the p-values to work is somewhat simple. You can get the other two p-values that you're looking for using the same syntax as @thelatemail used as follows:

#community (p = 0.1049)
chisq.test(cbind(c(473,923),c(372,631)),correct=FALSE)

#contaminants (p = 0.1443)
chisq.test(cbind(c(94,1302),c(53,950)),correct=FALSE)

You can get these answers more programmatically as follows:

out <- cbind(rowSums(SO_Example_v1[,2:6]),rowSums(SO_Example_v1[,7:11]))
chisq.test(rbind(out[1,],colSums(out[2:3,])),correct=FALSE)
chisq.test(rbind(out[2,],colSums(out[c(1,3),])),correct=FALSE)
chisq.test(rbind(out[3,],colSums(out[1:2,])),correct=FALSE)

Of course, we're getting beyond the scope of SO at this point, but perhaps a more pertinent question given the nature of the data is if there's a difference between the hospitals overall, which you can answer (from a frequentist perspective) using a chi-square test based on all three types:

chisq.test(out,correct=FALSE)

来源：https://stackoverflow.com/questions/28979696/can-someone-tell-me-why-r-is-not-using-the-whole-data-frame-for-this-chisq-test

标签

function

chi-squared