Using ggplot2 facet grid to explore large dataset with continuous and categorical variables

问题

I have a dataset with >1000 observations belonging to either group A or group B, and ~150 categorical and continuous variables. Small version below.

set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50,  replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))

I would like to visually compare group A and group B across each of the variables. To start I would like to make boxplot pairs showing A and B side by side for each continuous variable, and the same using bar plots for each categorical variable. Thinking that ggplot facet_grid would be ideal for this but not sure how to specify plot type according to data tyep, also not sure how to do this without specifying each variable one-by-one.

Interested in ggplot2 help and any alternative exploration techniques.

回答1:

Exploring our data is arguably the most interesting and intellectually challenging part of our research, so I encourage you to do some more reading into this topic.
Visualisation is of course important. @Parfait has suggested to shape your data long, which makes plotting easier. Your mix of continuous and categorical data is a bit tricky. Beginners often try very hard to avoid reshaping their data - but there is no need to fret! In the contrary, you will find that most questions require a specific shape of your data, and you will in most cases not find a "one fits all" shape.
So - the real challenge is how to shape your data before plotting. There are obviously many ways of doing this. Below one way, which should help "automatically" reshape columns that are continuous and those that are categorical. Comments in the code.

As a side note, when loading your data into R, I'd try to avoid storing categorical data as factors, and to convert to factors only when you need it. How to do this depends how you load your data. If it is from a csv, you can for example use read.csv('your.csv', stringsAsFactors = FALSE)

library(tidyverse)

``` r
# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])
data_num <- 
  mydf %>% 
  select(-ID) %>% 
  pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to =  'value')

#No need to use facet here
ggplot(data_num) +
  geom_boxplot(aes(key, value, color = group))

# selecting categorical columns is a bit more tricky in this example, 
# because your group is also categorical. 
# One way:
# first convert all categorical columns to character, 
# then turn your "group" into factor
# then gather the character columns: 

# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])

# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward 

data_cat <- 
  mydf %>% select(-ID) %>%
  mutate_if(.predicate = is.factor, .funs = as.character) %>%
  mutate(group = factor(group)) %>%
  pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to =  'value')%>%
  count(group, key, value) %>%
  group_by(group, key) %>%
  mutate(percent =  n/ sum(n)) %>%
  ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects

ggplot(data_cat) +
  geom_col(aes(group, percent, fill = key)) +
  facet_grid(~ value)

^{Created on 2020-01-07 by the reprex package (v0.3.0)}

Credit how to gather conditionally goes to this answer from @H1

回答2:

For your categorical variables, a useful way to present results in a comparable way is to show proportion of each answer option (i.e. % of big, small for the "size" variable and each color for the "color" variable). I know stack overflow generally suggests people that post a question to first demonstrate their attempt instead of asking for a ready-to-go solution and I would also suggest the same since it really helps each user learn more from their own attempt. However, I'm posting here a solution and wish it works as a starting point for you, if it's useful at all of course.

set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50,  replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))

# Data preparation for stacked barchart


library(tidyverse)
library(dplyr)

group_color_size_df <- mydf %>%
  select(group, color, size) %>%
  mutate(color = factor(color),
         group=factor(group),
         size=factor(size))

# Plot faceted stacked barchart

stacked_barchart <- group_color_size_df %>%
  ggplot(aes(x = group), fill = size) +
  scale_y_continuous(labels=c("0%","25%","50%","75%","100%"))+
  labs(title= "Group- Size relation")+
  geom_bar(aes(fill = size), width = .35, position = position_fill(reverse = TRUE)) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(), axis.text.y =element_blank(), axis.line = element_line(colour = "black"))+
  scale_fill_discrete(name="Size", labels=c("big","small"))+
  coord_flip()+
  facet_grid(group~., switch = "y", scales = "free", space = "free")

stacked_barchart

You can accordingly plot the color variable against the group.

Now for your continuous variables, the boxplot is a good idea you only need to use spread() on the "group" variable (tidyr package) to actually create two columns "A" and "B":

# Data wrangling for boxplot

length_per_group <- mydf %>%
  select(group, length, weight) %>%
  spread(., group, length) %>%
  select(A,B)

Here you won't need a facet just do a boxplot since each variable "A" and "B" now contains the "length" data. Then you can replace length with weight and do the same process for box plotting the "weight" variable.

I hope this helps, once you do your try let us know if more help is needed.

回答3:

What if you made the plots separately and then pieced them together in a grid?

set.seed(16)
mydf <- data.frame(ID = 1:50, group = sample(c("A", "B"), 50, replace = TRUE), length = rnorm(n = 50, mean = 0, sd = 1), weight = runif(50, min=0, max=1), color = sample(c("red", "orange", "yellow", "green", "blue"), 50,  replace = TRUE), size = sample(c("big", "small"), 50, replace = TRUE))


mydf


library(tidyverse)
library(cowplot)
library(reshape)

plot_continuous <- mydf %>%
    melt(id = "group", measure.vars = c("length", "weight")) %>%
    ggplot(aes(x = group, y = value)) +
    geom_boxplot() +
    facet_wrap(~variable)

plot_color <- mydf %>%
    count(group, color) %>%
    ggplot(aes(x = group, y = n)) +
    geom_col(aes(fill = color), position = "dodge") +
    ggtitle("Color")

plot_size <- mydf %>%
    count(group, size) %>%
    ggplot(aes(x = group, y = n)) +
    geom_col(aes(fill = size), position = "dodge") +
    ggtitle("Size")



plot_grid(plot_continuous, plot_color, plot_size, ncol = 2)

来源：https://stackoverflow.com/questions/59556286/using-ggplot2-facet-grid-to-explore-large-dataset-with-continuous-and-categorica

标签

ggplot2

data-visualization

frame