pylab.hist(data, normed=1). Normalization seems to work incorrect

后端 未结 7 1264
感情败类
感情败类 2020-12-01 07:40

I\'m trying to create a histogram with argument normed=1

For instance:

import pylab

data = ([1,1,2,3,3,3,3,3,4,5.1])    
pylab.hist(data, normed=1)
         


        
相关标签:
7条回答
  • 2020-12-01 07:54

    I think you are confusing bin heights with bin contents. You need to add the contents of each bin, i.e. height*width for all bins. That should = 1.

    0 讨论(0)
  • 2020-12-01 08:04

    See my other post for how to make the sum of all bins in a histogram equal to one: https://stackoverflow.com/a/16399202/1542814

    Copy & Paste:

    weights = np.ones_like(myarray)/float(len(myarray))
    plt.hist(myarray, weights=weights)
    

    where myarray contains your data

    0 讨论(0)
  • 2020-12-01 08:04

    I had the same problem, and while solving it another problem came up: how to plot the the normalised bin frequences as percentages with ticks on rounded values. I'm posting it here in case it is useful for anyone. In my example I chose 10% (0.1) as the maximum value for the y axis, and 10 steps (one from 0% to 1%, one from 1% to 2%, and so on). The trick is to set the ticks at the data counts (which are the output list n of the plt.hist) that will next be transformed into percentages using the FuncFormatter class. Here's what I did:

    import matplotlib.pyplot as plt
    from matplotlib.ticker import FuncFormatter
    
    fig, ax = plt.subplots()
    
    # The required parameters
    num_steps = 10
    max_percentage = 0.1
    num_bins = 40
    
    # Calculating the maximum value on the y axis and the yticks
    max_val = max_percentage * len(data)
    step_size = max_val / num_steps
    yticks = [ x * step_size for x in range(0, num_steps+1) ]
    ax.set_yticks( yticks )
    plt.ylim(0, max_val)
    
    # Running the histogram method
    n, bins, patches = plt.hist(data, num_bins)
    
    # To plot correct percentages in the y axis     
    to_percentage = lambda y, pos: str(round( ( y / float(len(data)) ) * 100.0, 2)) + '%'
    plt.gca().yaxis.set_major_formatter(FuncFormatter(to_percentage))
    
    plt.show()
    

    Plots

    Before normalisation: the y axis unit is number of samples within the bin intervals in the x axis: Before normalisation: the y axis unit is number of samples within the bin intervals in the x axis

    After normalisation: the y axis unit is frequency of the bin values as a percentage over all the samples After normalisation: the y axis unit is frequency of the bin values as a percentage over all the samples

    0 讨论(0)
  • 2020-12-01 08:08

    Your expectations are wrong

    The sum of the bins height times its width equals to one. Or, as you said correctly, the integral has to be one, not the function you are integrating about.

    It's like this: probability (as in "the probability that the person is between 20 and 40 years old is ...%") is the integral ("from 20 to 40 years old") over the probability density. The bins height shows the probability density whereas the width times height shows the probability (you integrate the constant assumed function, height of bin, from beginning of bin to end of bin) for a certain point to be in this bin. The height itself is the density and not a probability. It is a probability per width which can be higher then one of course.

    Simple example: imagine a probability density function from 0 to 1 that has value 0 from 0 to 0.9. What could the function possibly be between 0.9 and 1? If you integrate over it, try it out. It will be higher then 1.

    Btw: from a rough guess, the sum of height times width of your hist seems to yield roughly 1, doesn't it?

    0 讨论(0)
  • 2020-12-01 08:09

    According to documentation normed: If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. This is from numpy doc, but should be the same for pylab.

    In []: data= array([1,1,2,3,3,3,3,3,4,5.1])
    In []: counts, bins= histogram(data, normed= True)
    In []: counts
    Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22,  0.,  0.,  0.244,  0.,  0.244])
    In []: sum(counts* diff(bins))
    Out[]: 0.99999999999999989
    

    So simply normalization is done according to the documentation like:

    In []: counts, bins= histogram(data, normed= False)
    In []: counts
    Out[]: array([2, 0, 1, 0, 5, 0, 0, 1, 0, 1])
    In []: counts_n= counts/ sum(counts* diff(bins))
    In []: counts_n
    Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22 ,  0.,  0.,  0.244,  0.,  0.244])
    
    0 讨论(0)
  • 2020-12-01 08:10

    There is also numpy.histogram. If you set density=True, the output will be normalized.

    normed : bool, optional

    This keyword is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. Use the density keyword instead. If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that this latter behavior is known to be buggy with unequal bin widths; use density instead.

    density : bool, optional

    If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. Overrides the normed keyword if given.

    0 讨论(0)
提交回复
热议问题