I created a plot with ggplot2. It\'s about milk protein content. I have two groups and 4 treatments. I want to show the interaction between group and treatment, means and er
TL;DR: Look at the bottom.
Consider these figures:
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() +
theme_classic()
This is your basic plot. Now you have to consider the Y-axis.
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() +
theme_classic() +
scale_y_continuous(limits = c(0,NA), expand = c(0,0))
This is the least misleading way of emphasizing that there is a zero floor to the data, even if there are no actual points below a certain value. Percent milk protein is a good example of data where negative values are impossible and you want to emphasize that, but that no observations were near zero.
This also shrinks the explanatory range of the Y axis, so that there's less difference between the observations. If this is something you want to emphasize, that can be good. But if the natural range of some data is narrow, including the zero (and the resulting empty space) is misleading. For example, if milk protein is always between 2.6% and 2.7%, then the zero value is not a true floor for the data, but just as impossible as -50%.
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() +
theme_classic() +
scale_y_continuous(limits = c(0,NA), expand = c(0,0)) +
theme(axis.line.y = element_blank()) +
annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf)
There are many reasons not to include a broken Y axis. It's perceived by many as being unethical or misleading to include one inside ranges of data. But this particular case is at the outer limit, beyond the actual data. I think the rules can be bent a bit for that.
The first step is to remove the automatic Y axis line and draw it in "by hand" using annotate
. Notice that the figure looks identical to the one previous. If your theme of choice uses a lot of different sizes, you're gonna have a bad time.
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() +
theme_classic() +
scale_y_continuous(limits = c(3.5,NA), expand = c(0,0),
breaks = c(3.5, 4:7)) +
theme(axis.line.y = element_blank()) +
annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf)
Now you can consider where the actual data begin and where is a good spot to put the break. You have to check by hand; e.g. min(iris$Sepal.Length)
and consider where the tick marks will go. This is a personal judgment call.
I found that the lowest value was at 4.3. I knew I wanted the break to be below the minimum, and I wanted the break to be about 0.5 units long. So I chose to put a tick mark at 3.5, and then each integer afterwards with breaks = c(3.5, 4:7)
.
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() +
theme_classic() +
scale_y_continuous(limits = c(3.5,NA), expand = c(0,0),
breaks = c(3.5, 4:7), labels = c(0, 4:7)) +
theme(axis.line.y = element_blank()) +
annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf)
Now we need to relabel the 3.5 tick to be a fake zero with labels = c(0, 4:7)
.
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() +
theme_classic() +
scale_y_continuous(limits = c(3.5,NA), expand = c(0,0),
breaks = c(3.5, 4:7), labels = c(0, 4:7)) +
theme(axis.line.y = element_blank()) +
annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf) +
annotate(geom = "segment", x = -Inf, xend = -Inf, y = 3.5, yend = 4,
linetype = "dashed", color = "white")
Now we draw on a white dotted line over the manually-drawn axis line, going from our fake zero (y=3.5) to the lowest true tick mark (y=4).
Consider that the grammar of graphics is a mature philosophy; that is to say, each element has thoughtful reasoning behind it. The fact that this is finicky to do is for good reasons, and you need to consider whether your own reasons are sufficient weight on the other side.