r : ecdf over histogram

℡╲_俬逩灬. 提交于 2019-12-03 16:35:53
symbolrush

Also a bit late, here's another solution that extends @Christoph 's Solution with a second y-Axis.

par(mar = c(5,5,2,5))
set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
  dt,
  breaks = seq(0, 100, 1),
  xlim = c(0,100))

par(new = T)

ec <- ecdf(dt)
plot(x = h$mids, y=ec(h$mids)*max(h$counts), col = rgb(0,0,0,alpha=0), axes=F, xlab=NA, ylab=NA)
lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
axis(4, at=seq(from = 0, to = max(h$counts), length.out = 11), labels=seq(0, 1, 0.1), col = 'red', col.axis = 'red')
mtext(side = 4, line = 3, 'Cumulative Density', col = 'red')

The trick is the following: You don't add a line to your plot, but plot another plot on top, that's why we need par(new = T). Then you have to add the y-axis later on (otherwise it will be plotted over the y-axis on the left).

Credits go here (@tim_yates Answer) and there.

vpipkt

There are two ways to go about this. One is to ignore the different scales and use relative frequency in your histogram. This results in a harder to read histogram. The second way is to alter the scale of one or the other element.

I suspect this question will soon become interesting to you, particularly @hadley 's answer.

ggplot2 single scale

Here is a solution in ggplot2. I am not sure you will be satisfied with the outcome though because the CDF and histograms (count or relative) are on quite different visual scales. Note this solution has the data in a dataframe called mydata with the desired variable in x.

library(ggplot2)
set.seed(27272)
mydata <- data.frame(x=  rexp(333, rate=4) + rnorm(333))

 ggplot(mydata, aes(x)) + 
     stat_ecdf(color="red") + 
     geom_bar(aes(y = (..count..)/sum(..count..))) 

base R multi scale

Here I will rescale the empirical CDF so that instead of a max value of 1, its maximum value is whatever bin has the highest relative frequency.

h  <- hist(mydata$x, freq=F)
ec <- ecdf(mydata$x)
lines(x = knots(ec), 
    y=(1:length(mydata$x))/length(mydata$x) * max(h$density), 
    col ='red')

you can try a ggplot approach with a second axis

set.seed(15)
a <- rnorm(500, 50, 10)

# calculate ecdf with binsize 30
binsize=30
df <- tibble(x=seq(min(a), max(a), diff(range(a))/binsize)) %>% 
        bind_cols(Ecdf=with(.,ecdf(a)(x))) %>% 
        mutate(Ecdf_scaled=Ecdf*max(a))
# plot
ggplot() + 
  geom_histogram(aes(a), bins = binsize) +
  geom_line(data = df, aes(x=x, y=Ecdf_scaled), color=2, size = 2) + 
  scale_y_continuous(name = "Density",sec.axis = sec_axis(trans = ~./max(a), name = "Ecdf"))

As already pointed out, this is problematic because the plots you want to merge have such different y-scales. You can try

set.seed(15)
mydata<-runif(50)
hist(mydata, freq=F)
lines(ecdf(mydata))

to get

Although a bit late... Another version which is working with preset bins:

set.seed(15)
dt <- rnorm(500, 50, 10)
h <- hist(
    dt,
    breaks = seq(0, 100, 1),
    xlim = c(0,100))
    ec <- ecdf(dt)
    lines(x = h$mids, y=ec(h$mids)*max(h$counts), col ='red')
    lines(x = c(0,100), y=c(1,1)*max(h$counts), col ='red', lty = 3) # indicates 100%
    lines(x = c(which.min(abs(ec(h$mids) - 0.9)), which.min(abs(ec(h$mids) - 0.9))), # indicates where 90% is reached
          y = c(0, max(h$counts)), col ='black', lty = 3)

(Only the second y-axis is not working yet...)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!