Force R to plot histogram as probability (relative frequency)

前端 未结 5 2040
面向向阳花
面向向阳花 2021-01-31 05:40

I am having trouble plotting a histogram as a pdf (probability)

I want the sum of all the pieces to equal an area of one so it\'s easier to compare across datasets. For

相关标签:
5条回答
  • 2021-01-31 06:09

    To answer the request to plot probabilities rather than densities:

    h <- hist(vec, breaks = 100, plot=FALSE)
    h$counts=h$counts/sum(h$counts)
    plot(h)
    
    0 讨论(0)
  • 2021-01-31 06:11

    The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

    h <- hist(data[,1], freq=FALSE, breaks=800)
    sum(h$density * diff(h$breaks))
    

    you should get a result of 1.


    The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

    Does the value 1.0 have some kind of significance?

    0 讨论(0)
  • 2021-01-31 06:17

    I observed that, in histogram density = relative frequency / corresponding bin width

    Example 1:

    nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

    h2 = hist(nums, plot=F)

    rf2 = h2$counts / sum(h2$counts)

    d2 = rf2 / diff(h2$breaks)

    h2$density

    [1] 0.06 0.00 0.02 0.01 0.01

    d2

    [1] 0.06 0.00 0.02 0.01 0.01

    Example 2:

    nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

    h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

    rf3 = h3$counts / sum(h3$counts)

    d3 = rf3 / diff(h3$breaks)

    h3$density

    [1] 0.02758621 0.01000000 0.01000000

    d3

    [1] 0.02758621 0.01000000 0.01000000

    0 讨论(0)
  • 2021-01-31 06:17

    R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

    The solution is to call unlist() on the object first. This fixes the plot. enter image description hereenter image description here (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

    0 讨论(0)
  • 2021-01-31 06:28

    Are you sure? This is working for me:

    > vec <- rnorm(6000000)
    > 
    > h <- hist(vec, breaks = 800, freq = FALSE)
    > sum(h$density)
    [1] 100
    > unique(zapsmall(diff(h$breaks)))
    [1] 0.01
    

    Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

    This is with

    > sessionInfo()
    R version 3.0.1 RC (2013-05-11 r62732)
    Platform: x86_64-unknown-linux-gnu (64-bit)
    
    locale:
     [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
     [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
     [7] LC_PAPER=C                 LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    loaded via a namespace (and not attached):
    [1] tools_3.0.1
    
    0 讨论(0)
提交回复
热议问题