Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?

痴心易碎 提交于 2020-07-05 06:55:11

问题


Actually there are 2 questions, one is more advanced than the other.

Q1: I am looking for a method that similar to corrplot() but can deal with factors.

I originally tried to use chisq.test() then calculate the p-value and Cramer's V as correlation, but there too many columns to figure out. So could anyone tell me if there is a quick way to create a "corrplot" that each cell contains the value of Cramer's V, while the colour is rendered by p-value. Or any other kind of similar plot.

Regarding Cramer's V, let's say tbl is a 2-dimensional factor data frame.

chi2 <- chisq.test(tbl, correct=F)
Cramer_V <- sqrt(chi2$/nrow(tbl)) 

I prepared a test data frame with factors:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 

Q2: Then I would like to compute a correlation/association matrix on a mixed-types dataframe e.g.:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 
df$group <- factor(df$group, levels = c('A', 'B', 'C'), ordered = T)
df$student <- as.integer(df$student)

回答1:


Here's a tidyverse solution:

# example dataframe
df <- data.frame(
  group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
  student = c('01', '01', '01', '02', '02', '01', '02'),
  exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
  subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 

library(tidyverse)
library(lsr)

# function to get chi square p value and Cramers V
f = function(x,y) {
    tbl = df %>% select(x,y) %>% table()
    chisq_pval = round(chisq.test(tbl)$p.value, 4)
    cramV = round(cramersV(tbl), 4) 
    data.frame(x, y, chisq_pval, cramV) }

# create unique combinations of column names
# sorting will help getting a better plot (upper triangular)
df_comb = data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)

# apply function to each variable combination
df_res = map2_df(df_comb$X1, df_comb$X2, f)

# plot results
df_res %>%
  ggplot(aes(x,y,fill=chisq_pval))+
  geom_tile()+
  geom_text(aes(x,y,label=cramV))+
  scale_fill_gradient(low="red", high="yellow")+
  theme_classic()

Note that I'm using lsr package to calculate Cramers V using the cramersV function.




回答2:


The solution from @AntoniosK can be improved as suggested by @J.D. to also allow for mixed data-frames including both nominal and numerical attributes. Strength of association is calculated for nominal vs nominal with a bias corrected Cramer's V, numeric vs numeric with Spearman (default) or Pearson correlation, and nominal vs numeric with ANOVA.

require(tidyverse)
require(rcompanion)


# Calculate a pairwise association between all variables in a data-frame. In particular nominal vs nominal with Chi-square, numeric vs numeric with Pearson correlation, and nominal vs numeric with ANOVA.
# Adopted from https://stackoverflow.com/a/52557631/590437
mixed_assoc = function(df, cor_method="spearman", adjust_cramersv_bias=TRUE){
    df_comb = expand.grid(names(df), names(df),  stringsAsFactors = F) %>% set_names("X1", "X2")

    is_nominal = function(x) class(x) %in% c("factor", "character")
    # https://community.rstudio.com/t/why-is-purr-is-numeric-deprecated/3559
    # https://github.com/r-lib/rlang/issues/781
    is_numeric <- function(x) { is.integer(x) || is_double(x)}

    f = function(xName,yName) {
        x =  pull(df, xName)
        y =  pull(df, yName)

        result = if(is_nominal(x) && is_nominal(y)){
            # use bias corrected cramersV as described in https://rdrr.io/cran/rcompanion/man/cramerV.html
            cv = cramerV(as.character(x), as.character(y), bias.correct = adjust_cramersv_bias)
            data.frame(xName, yName, assoc=cv, type="cramersV")

        }else if(is_numeric(x) && is_numeric(y)){
            correlation = cor(x, y, method=cor_method, use="complete.obs")
            data.frame(xName, yName, assoc=correlation, type="correlation")

        }else if(is_numeric(x) && is_nominal(y)){
            # from https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable/124618#124618
            r_squared = summary(lm(x ~ y))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else if(is_nominal(x) && is_numeric(y)){
            r_squared = summary(lm(y ~x))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else {
            warning(paste("unmatched column type combination: ", class(x), class(y)))
        }

        # finally add complete obs number and ratio to table
        result %>% mutate(complete_obs_pairs=sum(!is.na(x) & !is.na(y)), complete_obs_ratio=complete_obs_pairs/length(x)) %>% rename(x=xName, y=yName)
    }

    # apply function to each variable combination
    map2_df(df_comb$X1, df_comb$X2, f)
}

Using the method, we can analyse a wide range of mixed variable data-frames easily:

mixed_assoc(iris)
              x            y      assoc        type complete_obs_pairs 
1  Sepal.Length Sepal.Length  1.0000000 correlation                150
2   Sepal.Width Sepal.Length -0.1667777 correlation                150
3  Petal.Length Sepal.Length  0.8818981 correlation                150
4   Petal.Width Sepal.Length  0.8342888 correlation                150
5       Species Sepal.Length  0.7865785       anova                150
6  Sepal.Length  Sepal.Width -0.1667777 correlation                150
7   Sepal.Width  Sepal.Width  1.0000000 correlation                150
25      Species      Species  1.0000000    cramersV                150

This can also be used along with the excellent corrr package, e.g. to draw a correlation network graph:

require(corrr)

msleep %>%
    select(- name) %>%
    mixed_assoc() %>%
    select(x, y, assoc) %>%
    spread(y, assoc) %>%
    column_to_rownames("x") %>%
    as.matrix %>%
    as_cordf %>%
    network_plot()




回答3:


Regarding Q1, you can use ?pairs.table from the vcd package, if you first convert your data frame with ?structable (from the same package). This will give you a plot matrix of mosaic plots. That isn't quite the same as what corrplot() does, but I suspect it would be a more useful visualization.

df <- data.frame(
  ... 
) 
library(vcd)
st <- structable(~group+student+exam_pass+subject, df)
st
#                 student       01                    02             
#                 subject Japanese Math Science Japanese Math Science
# group exam_pass                                                    
# A     N                        0    0       1        0    1       0
#       Y                        1    1       0        0    0       1
# B     N                        0    0       0        0    0       0
#       Y                        1    0       0        0    0       0
# C     N                        0    0       0        0    1       0
#       Y                        0    0       0        0    0       0
pairs(st)

There are a variety of other plots that are appropriate for categorical-categorical data, such as sieve plots, association plots, and pressure plots (see my question on Cross Validated here: Alternative to sieve / mosaic plots for contingency tables). You could write your own pairs-based function to put whatever you want in the upper or lower triangle panels (see my question here: Pairs matrix with qq-plots) if you don't prefer mosaic plots. Just remember that while plot matrices are very useful, they only ever display marginal projections (to understand this more fully, see my answers on CV here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?, and here: Alternatives to three dimensional scatter plot).

Regarding Q2, you would need to write a custom function.




回答4:


If you want to have a genuine correlation plot for factors or mixed-type, you can also use model.matrix to one-hot encode all non-numeric variables. This is quite different than calculating Cramér's V as it will consider your factor as separate variables, as many regression models do.

You can then use your favorite correlation-plot library. I personally like ggcorrplot for its ggplot2 compatibility.

Here is an example with your dataset:

library(ggcorrplot)
model.matrix(~0+., data=df) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2)



来源:https://stackoverflow.com/questions/52554336/plot-the-equivalent-of-correlation-matrix-for-factors-categorical-data-and-mi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!