I am having trouble plotting a histogram as a pdf (probability)
I want the sum of all the pieces to equal an area of one so it\'s easier to compare across datasets. For
To answer the request to plot probabilities rather than densities:
h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
The default number of breaks is around log2(N)
where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim
in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do
h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))
you should get a result of 1.
The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.
Does the value 1.0 have some kind of significance?
I observed that, in histogram density = relative frequency / corresponding bin width
Example 1:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h2 = hist(nums, plot=F)
rf2 = h2$counts / sum(h2$counts)
d2 = rf2 / diff(h2$breaks)
h2$density
[1] 0.06 0.00 0.02 0.01 0.01
d2
[1] 0.06 0.00 0.02 0.01 0.01
Example 2:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h3 = hist(nums, plot=F, breaks=c(1,30,40,50))
rf3 = h3$counts / sum(h3$counts)
d3 = rf3 / diff(h3$breaks)
h3$density
[1] 0.02758621 0.01000000 0.01000000
d3
[1] 0.02758621 0.01000000 0.01000000
R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.
The solution is to call unlist() on the object first. This fixes the plot. (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)
Are you sure? This is working for me:
> vec <- rnorm(6000000)
>
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01
Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.
This is with
> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1