Adding sample size to a box plot at the min or max of the facet in ggplot

吃可爱长大的小学妹 提交于 2021-01-01 08:07:14

问题


There are plenty of explanations, including this good one, of how to label box plots with sample size. All of them seem to use max(x) or median(x) to position the sample size.

I'm wondering if there is a way to easily position the labels at the top or bottom of the plot, especially when using the scale = "free_y" command in facet where the max and minimum value for the axis is picked automatically for each facet by ggplot.

The reason is that I am creating multiple facets where the distributions are narrow and the facets are small. It would be easier to read the sample size if it were positioned at the top or bottom of the plot...but I'd like to use "free_y" because there are meaningful differences in some facets that are obscured by the facets that have much larger spans in the data.

Using a slightly modified example from the linked post:

# function for number of observations 
give.n <- function(x){
  return(c(y = median(x)*1.05, label = length(x))) 
  # experiment with the multiplier to find the perfect position
}

# function for mean labels
mean.n <- function(x){
  return(c(y = median(x)*0.97, label = round(mean(x),2))) 
  # experiment with the multiplier to find the perfect position
}

# plot
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
  stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red") +
  facet_grid(cyl~., scale="free_y")

Given this setup, how could I find the min or max of the x axis for each facet and position the sample size there instead of at the median, min or max of each box-and-whisker?

EDIT

I'm updating the question with information from R.S.'s answer below. It's still not answered yet, but their suggestion provides a solution for where to find this information.

ggplot_build(gg)$layout$panel_ranges[[order(levels(factor(mtcars$cyl)))[1]]]$y.range[1]

gives the minimum of the y range for the first factor of mtcars$cyl. So, by my logic, we need to build the plot, without the stat_summary statements, then find the sample size and minimum y-range using the give.n function. After that, we can add the stat_summary statement to the plot...like below:

# plot
gg = ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  facet_grid(cyl~., scale="free_y")

# function for number of observations 
give.n <- function(x){
  return(c(y = ggplot_build(gg)$layout$panel_ranges[[order(levels(factor(mtcars$cyl)))[x]]]$y.range[1], label = length(x))) 
  # experiment with the multiplier to find the perfect position
}

gg +
  stat_summary(fun.data = give.n, geom = "text", fun.y = "median")

But...the above code doesn't work because I don't really understand what the give.n function is iterating over. Replacing [[x]] with any of 1:3 plots all the sample sizes at the minimum for that facet, so that is progress.

Here is the plot using [[2]], so all sample sizes are plotted at 17.62, the minimum value of the range for the second facet.


回答1:


You can examine the structure of the ggplot object using ggplot_build, in particular the x and y panel ranges are stored in layout. Assign your plot to an object and look at the structure:

gg <- ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
  geom_boxplot(fill = "grey80", colour = "#3366FF") +
  stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
  stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red") +
  facet_grid(cyl~., scale="free_y")

  ggplot_build(gg)

In particular you will be interested in:

  ggplot_build(gg)$layout$panel_ranges

The ylim of the 3 panels are given as c(ymin, ymax) and stored under:

 ggplot_build(gg)$layout$panel_ranges[[1]]$y.range
 ggplot_build(gg)$layout$panel_ranges[[2]]$y.range
 ggplot_build(gg)$layout$panel_ranges[[3]]$y.range

Edited to respond to comment and how to incorporate this layout info into the plot. Here we calculate the stat summaries grouped by cyl separately using dplyr, and create separate data frame to incorporate into ggplot2, instead of using stat_summary.

 library(dplyr)
 gg.summary <- group_by(mtcars, cyl) %>% summarise(mean=mean(mpg), median=median(mpg), length=length(mpg))

Parse the the ylim ranges and include into the stat summary df, the stat summary df is grouped by cyl which is the variable we are faceting:

 gg.summary$panel.ylim <- sapply(order(levels(factor(mtcars$cyl))), function(x) ggplot_build(gg)$layout$panel_ranges[[x]]$y.range[1])
 # # A tibble: 3 x 5
 # cyl     mean median length panel.ylim
 # <dbl>    <dbl>  <dbl>  <int>      <dbl>
 # 1     4 26.66364   26.0     11     20.775
 # 2     6 19.74286   19.7      7     17.620
 # 3     8 15.10000   15.2     14      9.960

Use in ggplot, I believe this is the plot you want:

 gg + geom_text(data=gg.summary, (aes(x=factor(cyl), y=panel.ylim, label=paste("n =",length)))) +
   geom_text(data=gg.summary, (aes(x=factor(cyl), y=median*0.97, label=format(median, nsmall=2))))



来源:https://stackoverflow.com/questions/42822273/adding-sample-size-to-a-box-plot-at-the-min-or-max-of-the-facet-in-ggplot

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!