Remove outliers fully from multiple boxplots made with ggplot2 in R and display the boxplots in expanded format

痞子三分冷 提交于 2020-01-11 15:38:11

问题


I have some data here [in a .txt file] which I read into a data frame df,

df <- read.table("data.txt", header=T,sep="\t")

I remove the negative values in the column x (since I need only positive values) of the df using the following code,

yp <- subset(df, x>0)

Now I want plot multiple box plots in the same layer. I first melt the data frame df, and the plot which results contains several outliers as shown below.

# Melting data frame df    
df_mlt <-melt(df, id=names(df)[1])
    # plotting the boxplots
    plt_wool <- ggplot(subset(df_mlt, value > 0), aes(x=ID1,y=value)) + 
      geom_boxplot(aes(color=factor(ID1))) +
      scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +    
      theme_bw() +
      theme(legend.text=element_text(size=14), legend.title=element_text(size=14))+
      theme(axis.text=element_text(size=20)) +
      theme(axis.title=element_text(size=20,face="bold")) +
      labs(x = "x", y = "y",colour="legend" ) +
      annotation_logticks(sides = "rl") +
      theme(panel.grid.minor = element_blank()) +
      guides(title.hjust=0.5) +
      theme(plot.margin=unit(c(0,1,0,0),"mm")) 
    plt_wool

Now I need to have a plot without any outliers, so to do this first I compute the lower and upper bound whiskers I use the following code as suggested here,

sts <- boxplot.stats(yp$x)$stats

To remove the outlier I add the upper and lower whisker limits as below,

p1 = plt_wool + coord_cartesian(ylim = c(sts*1.05,sts/1.05))

The resulting plot is shown below, while the above line of code correctly removes most of the top outliers all the bottom outliers still remain. Could someone please suggest how to remove all the outlier completely from this plot, Thanks.


回答1:


A minimal reproducible example:

library(ggplot2)
p <- ggplot(mtcars, aes(factor(cyl), mpg))
p + geom_boxplot()

Not plotting outliers:

p + geom_boxplot(outlier.shape=NA)
#Warning message:
#Removed 3 rows containing missing values (geom_point).

(I prefer to get this warning, because a year from now with a long script it would remind me that I did something special there. If you want to avoid it use Sven's solution.)




回答2:


Based on suggestions by @Sven Hohenstein, @Roland and @lukeA I have solved the problem for displaying multiple boxplots in expanded form without outliers.

First plot the box plots without outliers by using outlier.colour=NA in geom_boxplot()

plt_wool <- ggplot(subset(df_mlt, value > 0), aes(x=ID1,y=value)) + 
  geom_boxplot(aes(color=factor(ID1)),outlier.colour = NA) +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
  theme_bw() +
  theme(legend.text=element_text(size=14), legend.title=element_text(size=14))+
  theme(axis.text=element_text(size=20)) +
  theme(axis.title=element_text(size=20,face="bold")) +
  labs(x = "x", y = "y",colour="legend" ) +
  annotation_logticks(sides = "rl") +
  theme(panel.grid.minor = element_blank()) +
  guides(title.hjust=0.5) +
  theme(plot.margin=unit(c(0,1,0,0),"mm"))

Then compute the lower, upper whiskers using boxplot.stats() as the code below. Since I only take into account positive values, I choose them using the condition in the subset().

yp <- subset(df, x>0)             # Choosing only +ve values in col x
sts <- boxplot.stats(yp$x)$stats  # Compute lower and upper whisker limits

Now to achieve full expanded view of the multiple boxplots, it is useful to modify the y-axis limit of the plot inside coord_cartesian() function as below,

p1 = plt_wool + coord_cartesian(ylim = c(sts[2]/2,max(sts)*1.05))

Note: The limits of y should be adjusted according to the specific case. In this case I have chosen half of lower whisker limit for ymin.

The resulting plot is below,




回答3:


You can make the outliers invisible with the argument outlier.colour = NA:

geom_boxplot(aes(color = factor(ID1)), outlier.colour = NA)



回答4:


ggplot(df_mlt, aes(x = ID1, y = value)) + 
  geom_boxplot(outlier.size = NA) + 
  coord_cartesian(ylim = range(boxplot(df_mlt$value, plot=FALSE)$stats)*c(.9, 1.1))



回答5:


Another way to exclude outliers is to calculate them then set the y-limit on what you consider an outlier.

For example, if your upper and lower limits are Q3 + 1.5 IQR and Q1 - 1.5 IQR, then you may use:

upper.limit <- quantile(x)[4] + 1.5*IQR(x)
lower.limit <- quantile(x)[2] - 1.5*IQR(x)

Then put limits on the y-axis range:

ggplot + coord_cartesian(ylim=c(lower.limit, upper.limit))


来源:https://stackoverflow.com/questions/21533158/remove-outliers-fully-from-multiple-boxplots-made-with-ggplot2-in-r-and-display

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!