I\'m trying to create a histogram with argument normed=1
For instance:
import pylab
data = ([1,1,2,3,3,3,3,3,4,5.1])
pylab.hist(data, normed=1)
I think you are confusing bin heights with bin contents. You need to add the contents of each bin, i.e. height*width for all bins. That should = 1.
See my other post for how to make the sum of all bins in a histogram equal to one: https://stackoverflow.com/a/16399202/1542814
Copy & Paste:
weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)
where myarray contains your data
I had the same problem, and while solving it another problem came up: how to plot the the normalised bin frequences as percentages with ticks on rounded values. I'm posting it here in case it is useful for anyone. In my example I chose 10% (0.1) as the maximum value for the y axis, and 10 steps (one from 0% to 1%, one from 1% to 2%, and so on). The trick is to set the ticks at the data counts (which are the output list n
of the plt.hist
) that will next be transformed into percentages using the FuncFormatter
class. Here's what I did:
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
fig, ax = plt.subplots()
# The required parameters
num_steps = 10
max_percentage = 0.1
num_bins = 40
# Calculating the maximum value on the y axis and the yticks
max_val = max_percentage * len(data)
step_size = max_val / num_steps
yticks = [ x * step_size for x in range(0, num_steps+1) ]
ax.set_yticks( yticks )
plt.ylim(0, max_val)
# Running the histogram method
n, bins, patches = plt.hist(data, num_bins)
# To plot correct percentages in the y axis
to_percentage = lambda y, pos: str(round( ( y / float(len(data)) ) * 100.0, 2)) + '%'
plt.gca().yaxis.set_major_formatter(FuncFormatter(to_percentage))
plt.show()
Before normalisation: the y axis unit is number of samples within the bin intervals in the x axis:
After normalisation: the y axis unit is frequency of the bin values as a percentage over all the samples
Your expectations are wrong
The sum of the bins height times its width equals to one. Or, as you said correctly, the integral has to be one, not the function you are integrating about.
It's like this: probability (as in "the probability that the person is between 20 and 40 years old is ...%") is the integral ("from 20 to 40 years old") over the probability density. The bins height shows the probability density whereas the width times height shows the probability (you integrate the constant assumed function, height of bin, from beginning of bin to end of bin) for a certain point to be in this bin. The height itself is the density and not a probability. It is a probability per width which can be higher then one of course.
Simple example: imagine a probability density function from 0 to 1 that has value 0 from 0 to 0.9. What could the function possibly be between 0.9 and 1? If you integrate over it, try it out. It will be higher then 1.
Btw: from a rough guess, the sum of height times width of your hist seems to yield roughly 1, doesn't it?
According to documentation normed: If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. This is from numpy doc, but should be the same for pylab.
In []: data= array([1,1,2,3,3,3,3,3,4,5.1])
In []: counts, bins= histogram(data, normed= True)
In []: counts
Out[]: array([ 0.488, 0., 0.244, 0., 1.22, 0., 0., 0.244, 0., 0.244])
In []: sum(counts* diff(bins))
Out[]: 0.99999999999999989
So simply normalization is done according to the documentation like:
In []: counts, bins= histogram(data, normed= False)
In []: counts
Out[]: array([2, 0, 1, 0, 5, 0, 0, 1, 0, 1])
In []: counts_n= counts/ sum(counts* diff(bins))
In []: counts_n
Out[]: array([ 0.488, 0., 0.244, 0., 1.22 , 0., 0., 0.244, 0., 0.244])
There is also numpy.histogram. If you set density=True
, the output will be normalized.
normed : bool, optional
This keyword is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. Use the density keyword instead. If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that this latter behavior is known to be buggy with unequal bin widths; use density instead.
density : bool, optional
If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. Overrides the normed keyword if given.