Plotting a histogram with overlaid PDF

后端 未结 1 1867
清歌不尽
清歌不尽 2021-01-21 04:19

This is a follow-up to my previous couple of questions. Here\'s the code I\'m playing with:

import pandas as pd
import matplotlib.pyplot as plt
import scipy.stat         


        
相关标签:
1条回答
  • 2021-01-21 04:42

    You should plot the histogram with density=True if you hope to compare it to a true PDF. Otherwise your normalization (amplitude) will be off.

    Also, you need to specify the x-values (as an ordered array) when you plot the pdf:

    fig, ax = plt.subplots()
    
    df2[df2[column] > -999].hist(column, alpha = 0.5, density=True, ax=ax)
    
    param = stats.norm.fit(df2[column].dropna())
    x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
    
    plt.plot(x, stats.norm.pdf(x, *param), color = 'r')
    plt.show()
    


    As an aside, using a histogram to compare continuous variables with a distribution is isn't always the best. (Your sample data are discrete, but the link uses a continuous variable). The choice of bins can alias the shape of your histogram, which may lead to incorrect inference. Instead, the ECDF is a much better (choice-free) illustration of the distribution for a continuous variable:

    def ECDF(data):
        n = sum(data.notnull())
        x = np.sort(data.dropna())
        y = np.arange(1, n+1) / n
        return x,y
    
    fig, ax = plt.subplots()
    
    plt.plot(*ECDF(df2.loc[df2[column] > -999, 'B']), marker='o')
    
    param = stats.norm.fit(df2[column].dropna())
    x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
    
    plt.plot(x, stats.norm.cdf(x, *param), color = 'r')
    plt.show()
    

    0 讨论(0)
提交回复
热议问题