pylab.hist(data, normed=1). Normalization seems to work incorrect

后端未结

关注

 7  1264

I\'m trying to create a histogram with argument normed=1

For instance:

import pylab

data = ([1,1,2,3,3,3,3,3,4,5.1])    
pylab.hist(data, normed=1)


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2020-12-01 07:54
              
            
            
                                                                       
I think you are confusing bin heights with bin contents. You need to add the contents of each bin, i.e. height*width for all bins. That should = 1.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  故里飘歌        
                
              
                            
                2020-12-01 08:04
              
            
            
                                                                       
See my other post for how to make the sum of all bins in a histogram equal to one:
https://stackoverflow.com/a/16399202/1542814

Copy & Paste:

weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)


where myarray contains your data
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2020-12-01 08:04
              
            
            
                                                                       
I had the same problem, and while solving it another problem came up: how to plot the the normalised bin frequences as percentages with ticks on rounded values. I'm posting it here in case it is useful for anyone. In my example I chose 10% (0.1) as the maximum value for the y axis, and 10 steps (one from 0% to 1%, one from 1% to 2%, and so on). The trick is to set the ticks at the data counts (which are the output list n of the plt.hist) that will next be transformed into percentages using the FuncFormatter class. Here's what I did:

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

fig, ax = plt.subplots()

# The required parameters
num_steps = 10
max_percentage = 0.1
num_bins = 40

# Calculating the maximum value on the y axis and the yticks
max_val = max_percentage * len(data)
step_size = max_val / num_steps
yticks = [ x * step_size for x in range(0, num_steps+1) ]
ax.set_yticks( yticks )
plt.ylim(0, max_val)

# Running the histogram method
n, bins, patches = plt.hist(data, num_bins)

# To plot correct percentages in the y axis     
to_percentage = lambda y, pos: str(round( ( y / float(len(data)) ) * 100.0, 2)) + '%'
plt.gca().yaxis.set_major_formatter(FuncFormatter(to_percentage))

plt.show()


Plots

Before normalisation: the y axis unit is number of samples within the bin intervals in the x axis:


After normalisation: the y axis unit is frequency of the bin values as a percentage over all the samples

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2020-12-01 08:08
              
            
            
                                                                       
Your expectations are wrong

The sum of the bins height times its width equals to one. Or, as you said correctly, the integral has to be one, not the function you are integrating about.

It's like this: probability (as in "the probability that the person is between 20 and 40 years old is ...%") is the integral ("from 20 to 40 years old") over the probability density. The bins height shows the probability density whereas the width times height shows the probability (you integrate the constant assumed function, height of bin, from beginning of bin to end of bin) for a certain point to be in this bin. The height itself is the density and not a probability. It is a probability per width which can be higher then one of course.

Simple example: imagine a probability density function from 0 to 1 that has value 0 from 0 to 0.9. What could the function possibly be between 0.9 and 1? If you integrate over it, try it out. It will be higher then 1.

Btw: from a rough guess, the sum of height times width of your hist seems to yield roughly 1, doesn't it?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  栀梦        
                
              
                            
                2020-12-01 08:09
              
            
            
                                                                       
According to documentation normed:  If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. This is from numpy doc, but should be the same for pylab.

In []: data= array([1,1,2,3,3,3,3,3,4,5.1])
In []: counts, bins= histogram(data, normed= True)
In []: counts
Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22,  0.,  0.,  0.244,  0.,  0.244])
In []: sum(counts* diff(bins))
Out[]: 0.99999999999999989


So simply normalization is done according to the documentation like:

In []: counts, bins= histogram(data, normed= False)
In []: counts
Out[]: array([2, 0, 1, 0, 5, 0, 0, 1, 0, 1])
In []: counts_n= counts/ sum(counts* diff(bins))
In []: counts_n
Out[]: array([ 0.488,  0.,  0.244,  0.,  1.22 ,  0.,  0.,  0.244,  0.,  0.244])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-12-01 08:10
              
            
            
                                                                       
There is also numpy.histogram. If you set density=True, the output will be normalized.


  normed : bool, optional
  
  This keyword is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. Use the density keyword instead. If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that this latter behavior is known to be buggy with unequal bin widths; use density instead.
  
  density : bool, optional
  
  If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. Overrides the normed keyword if given.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复