问题
I have a data frame of items, and each has multiple classifier columns that are categorical variables.
ID test1 test2 test3
1 A B A
2 B A C
3 C C C
4 A A B
5 B B B
6 B A C
I want to generate a heatmap for each combination of test columns (test1 v test2, test1 v test3, etc.) using ggplot2. The heatmap would have all factors in that test's column (in this case A,B,C) on the x-side and all factors of the other test on the y-side, and the boxes in the heatmap should be colored based on the count of ids that have that combination of classifier.
For example in the above input, if we have heatmap between test1 and test2, then the box that is in the intersection of B for test1 and A for test2 would be brightest, since there are 2 ids with that combination. I hope to use these heatmaps to analyze which tests are most congruent for the data set, but can't use a Pearson's R correlation since they are categorical variables.
I am familiar with ggplot, which is why I prefer that package, but if it is easier in pheatplot, I am okay with learning that.
回答1:
Took some time to realize how to do it, and still I am not sure it is the best way.
Data:dat = structure(list(ID = 1:6,
test1 = c("A", "B", "C", "A", "B", "B"),
test2 = c("B", "A", "C", "A", "B", "A"),
test3 = c("A", "C", "C", "B", "B", "C")
),
.Names = c("ID", "test1", "test2", "test3"),
class = "data.frame", row.names = c(NA, -6L)
)
Libraries
library(tidyverse)
library(ggthemes)
library(gridExtra)
Create all all combinations of factors (also tests) taken 2 at a time
fcombs <- expand.grid(LETTERS[1:3], LETTERS[1:3], stringsAsFactors = F)
tcombs <- as.data.frame(combn(colnames(dat[,-1]), 2), stringsAsFactors = F)
lapply
through the tests combinations, full_join
, count length of each group excluding NA
s
dtl <- lapply(tcombs, function(i){
select(dat, ID, i) %>%
full_join(x = fcombs, by = c("Var1" = i[1], Var2 = i[2])) %>%
group_by(Var1, Var2) %>%
mutate(N = sum(!is.na(ID)), ID = NULL) %>%
ungroup()
}
)
Create a list of plots
pl <- lapply(seq_along(tcombs), function(i){
gtitle = paste(tcombs[[i]], collapse = " ~ ")
dtl[[i]] %>%
ggplot(aes(x = Var1, y = Var2, fill = N)) +
geom_tile() +
theme_tufte() +
theme(axis.title = element_blank()) +
ggtitle(gtitle)
}
)
Create list of tables (tableGrob
objects)
tbl <- lapply(tcombs, function(i) tableGrob(select(dat, ID, i),
theme = ttheme_minimal()))
Put everything into the resulting list and plot
resl <- c(pl, tbl)[c(1, 4, 2, 5, 3, 6)]
grid.arrange(grobs = resl, ncol = 2, nrow = 3)
回答2:
Your question is bit unclear, but I think you are looking for something like this. I am not a ggplot2 person so I will let someone else provide that code.
x <- read.table(text="ID test1 test2 test3
1 A B A
2 B A C
3 C C C
4 A A B
5 B B B
6 B A C", stringsAsFactors=FALSE, header=T)
xl <- reshape2::melt(data = x, id.vars="ID", variable.name = "Test", value.name="Grade")
xl$Test_Gr <- apply(xl[,2:3], 1, paste0, collapse="_")
xw <- reshape2::dcast(xl, ID ~ Test_Gr, fun.aggregate = length)
xwm <- as.matrix(xw[,-1])
xc <- t(xwm) %*% xwm
colnames(xc) <- colnames(xw)[-1]
rownames(wc) <- colnames(xw)[-1]
gplots::heatmap.2(xc, trace="none", col = rev(heat.colors(15)))
来源:https://stackoverflow.com/questions/51028547/heatmap-of-categorical-variable-counts