Force R to plot histogram as probability (relative frequency)

前端未结

关注

 5  2040

面向向阳花

I am having trouble plotting a histogram as a pdf (probability)

I want the sum of all the pieces to equal an area of one so it\'s easier to compare across datasets. For

相关标签:

5条回答

孤独总比滥情好

2021-01-31 06:09
To answer the request to plot probabilities rather than densities:
```
h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2021-01-31 06:11
The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do
```
h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))
```
you should get a result of 1.

The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

Does the value 1.0 have some kind of significance?
0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2021-01-31 06:17

I observed that, in histogram density = relative frequency / corresponding bin width

Example 1:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h2 = hist(nums, plot=F)

rf2 = h2$counts / sum(h2$counts)

d2 = rf2 / diff(h2$breaks)

h2$density

[1] 0.06 0.00 0.02 0.01 0.01

d2

[1] 0.06 0.00 0.02 0.01 0.01

Example 2:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

rf3 = h3$counts / sum(h3$counts)

d3 = rf3 / diff(h3$breaks)

h3$density

[1] 0.02758621 0.01000000 0.01000000

d3

[1] 0.02758621 0.01000000 0.01000000

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2021-01-31 06:17

R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

The solution is to call unlist() on the object first. This fixes the plot. (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

0 讨论(0)
发布评论:

提交评论
- 加载中...

爱一瞬间的悲伤

2021-01-31 06:28

Are you sure? This is working for me:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

This is with

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1

0 讨论(0)