问题
I would like to add shaded standard error curves to geom_density
using ggplot2
. My code looks like this:
data.plot <- data.frame(x = c(rnorm(100, mean = 0, sd = 5),
rnorm(100, mean = 1, sd =2 )),
g = factor(c(rep(1, 100),
rep(2,100))))
ggplot(data.plot, aes(x, linetype = g)) + geom_density()
I couldn't find a tutorial or examples to do so. Thank you.
回答1:
The best solution is with bootstrapping, as mentioned in the comments. I'll use the classic iris data, focusing on the density of Sepal.Length for each Species.
Generate samples
library(dplyr)
library(broom)
library(tidyr)
library(ggplot2)
data_frame(bs = 1:1000) %>% group_by(bs) %>%
mutate(data = list(iris %>% group_by(Species) %>%
sample_frac(size = 1, replace = T)))
# A tibble: 1,000 x 2 # Groups: bs [1,000] bs data <int> <list> 1 1 <tibble [150 x 5]> 2 2 <tibble [150 x 5]> 3 3 <tibble [150 x 5]> 4 4 <tibble [150 x 5]> 5 5 <tibble [150 x 5]> 6 6 <tibble [150 x 5]> 7 7 <tibble [150 x 5]> 8 8 <tibble [150 x 5]> 9 9 <tibble [150 x 5]> 10 10 <tibble [150 x 5]> # ... with 990 more rows
So I just made 1000 bootstrap replicates of the original data, taking the same number of rows out of each group as its original sample size, with replacement. Now I have to unnest
to access the data in the nested data column.
Compute within-sample density
densities.within <-
data_frame(bs = 1:1000) %>% group_by(bs) %>%
mutate(data = list(iris %>% group_by(Species) %>%
sample_frac(size = 1, replace = T))) %>%
unnest() %>%
group_by(bs, Species) %>%
do(tidy(density(.$Sepal.Length,
from = min(iris$Sepal.Length),
to = max(iris$Sepal.Length),
n = 128)))
# A tibble: 384,000 x 4 # Groups: bs, Species [30,000] bs Species x y <int> <fctr> <dbl> <dbl> 1 1 setosa 4.300000 0.2395786 2 1 setosa 4.328346 0.2821128 3 1 setosa 4.356693 0.3235939 4 1 setosa 4.385039 0.3632449 5 1 setosa 4.413386 0.4010378 6 1 setosa 4.441732 0.4375189 7 1 setosa 4.470079 0.4734727 8 1 setosa 4.498425 0.5095333 9 1 setosa 4.526772 0.5459280 10 1 setosa 4.555118 0.5824587 # ... with 383,990 more rows
So we expanded the data to its long form, then took the density of each group's Sepal.Length within Species within bs. We had to supply a manual from =
and to =
since the min and max within each bootstrap might differ (and set a lower n =
than the default 512 just to save time).
To simplify the S3: density
object that's generated, we use broom::tidy
. This is the computationally intensive step, so we'll save this object as densities.within.
Summarize densities into quantiles
This results in columns named x and y, but we'll rename these to match our data. Then we'll figure out: for the densities calculated at each computed possible Sepal.Length, what is the lower end of the CI, the median, and the upper end of the CI? We'll use quantile
to get these specific values of the computed densities.
densities.qtiles <-
densities.within %>%
rename(Sepal.Length = x, dens = y) %>%
ungroup() %>%
group_by(Species, Sepal.Length) %>%
summarise(q05 = quantile(dens, 0.025),
q50 = quantile(dens, 0.5),
q95 = quantile(dens, 0.975))
# A tibble: 384 x 5 # Groups: Species [?] Species Sepal.Length q05 q50 q95 <fctr> <dbl> <dbl> <dbl> <dbl> 1 setosa 4.300000 0.05730022 0.2355335 0.4426299 2 setosa 4.328346 0.08177850 0.2734463 0.4970097 3 setosa 4.356693 0.09863062 0.3114570 0.5505578 4 setosa 4.385039 0.12459033 0.3430645 0.5884523 5 setosa 4.413386 0.15049699 0.3705389 0.6207344 6 setosa 4.441732 0.17494889 0.4006335 0.6418923 7 setosa 4.470079 0.19836510 0.4258006 0.6655006 8 setosa 4.498425 0.21106857 0.4555755 0.6971370 9 setosa 4.526772 0.23399070 0.4813130 0.7244413 10 setosa 4.555118 0.24863090 0.5108057 0.7708114 # ... with 374 more rows
Visualize and compare
ggplot(densities.qtiles, aes(Sepal.Length, q50)) +
facet_wrap(~Species, nrow = 2) +
geom_histogram(data = iris,
aes(Sepal.Length, ..density..),
colour = "black", fill = "white",
binwidth = 0.25, boundary = 0) +
geom_ribbon(aes(ymin = q05, ymax = q95), alpha = 0.5, fill = "grey50") +
stat_density(data = iris,
aes(Sepal.Length, ..density.., color = "raw density"),
size = 2, geom = "line") +
geom_line(size = 1.5, aes(color = "bootstrapped")) +
scale_color_manual(values = c("red", "black")) +
labs(y = "density") +
theme(legend.position = c(0.5,0),
legend.justification = c(0,0))
I included histogram and density layers for the original data as well for comparison. You can see that the median and raw densities are very close with 1000 bootstrap samples.
来源:https://stackoverflow.com/questions/45908120/add-shaded-standard-error-curves-to-geom-density-in-ggplot2