Evaluate the goodness of a distributional fits

问题

I have fitted some distributions for sample data with the following code:

import numpy as np 
import pylab
import matplotlib.pyplot as plt
from scipy.stats import norm

samp = norm.rvs(loc=0,scale=1,size=150) # (example) sample values. 

figprops = dict(figsize=(8., 7. / 1.618), dpi=128)                       
adjustprops = dict(left=0.1, bottom=0.1, right=0.97, top=0.93, wspace=0.2, hspace=0.2)

import pylab
fig = pylab.figure(**figprops)                                            
fig.subplots_adjust(**adjustprops)
ax = fig.add_subplot(1, 1, 1)  
ax.hist(samp,bins=10,density=True,alpha=0.6,color='grey', label='Data')
xmin, xmax = plt.xlim()

# Distributions. 
import scipy.stats
dist_names = ['beta', 'norm','gumbel_l'] 
for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(samp)
    x = np.linspace(xmin, xmax, 100) # 
    ax.plot(x,dist(*param).pdf(x),linewidth=4,label=dist_name)

ax.legend(fontsize=14)
plt.savefig('example.png')

How do I order the distribution names in the legend from best fit (top) to worst fit automatically? I have generated in a loop random variables, the result of the best fit may be different each iteration.

回答1:

Well, you could use Kolmogorov-Smirnov (K-S) test to compute, f.e., p-value and sort by it

Modifying your loop

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(samp)
    x = np.linspace(xmin, xmax, 100) # 
    ax.plot(x,dist(*param).pdf(x),linewidth=4,label=dist_name)

    ks = scipy.stats.kstest(samp, dist_name, args=param)
    print((dist_name, ks))

You could get as output something like

('beta', KstestResult(statistic=0.033975289251035434, pvalue=0.9951529119440156))
('norm', KstestResult(statistic=0.03164417055025992, pvalue=0.9982475331007705))
('gumbel_l', KstestResult(statistic=0.113229070386386, pvalue=0.039394595923043355))

which tells you normal and beta are pretty good, but gumbel should be last. Sorting based on either P-value or statistics should be easy to add

Your result might be different and would depend on RNG initial state.

UPDATE

Concerning non-applicability of the K-S test for goodness-of-fit estimate, I strongly disagree. I don't see scientific reason NOT to use it, and I used it myself for good.

Typically, you have black box generating your random data, let's say some measurements of network delays

In general, it could be described by mixture of Gammas, and you do your fit using some kind of quadratic utility function and get back set of parameters

Then you use K-S or any other empirical vs theoretical distribution method to estimate how good fit is. If K-S method is not used to make a fit, then it is perfectly good approach to use K-S.

You basically have one black box generating data, another black box fitting data, and want to know how well fit fits the data. Then K-S will do the job.

And statement "it is commonly used as a test for normality to see if your data is normally distributed." is completely off, in my humble opinion. K-S is about CDF-vs-CDF maximum discrepancy, and it doesn't care about normalcy, it is a lot more universal

来源：https://stackoverflow.com/questions/61276051/evaluate-the-goodness-of-a-distributional-fits

标签

scipy

distribution

scipy.stats