问题

i am trying to fit distributions. The fitting is finished, but i need a measurement, to choose the best model. Many papers are using the Kolomogorov-Smirnov (KS) test. I tried to implement that, and i am getting very low p-value results.

The implementation:

#Histigram plot

binwidth = np.arange(0,int(out_threshold1),1)    
n1, bins1, patches = plt.hist(h1, bins=binwidth, normed=1, facecolor='#023d6b', alpha=0.5, histtype='bar')

#Fitting

gevfit4 = gev.fit(h1)  
pdf_gev4 = gev.pdf(lnspc, *gevfit4)   
plt.plot(lnspc, pdf_gev4, label="GEV")

logfit4 = stats.lognorm.fit(h)  
pdf_lognorm4 = stats.lognorm.pdf(lnspc, *logfit4)  
plt.plot(lnspc, pdf_lognorm4, label="LogNormal")

weibfit4 = stats.weibull_min.fit(h1)  
pdf_weib4 = stats.weibull_min.pdf(lnspc, *weibfit4)  
plt.plot(lnspc, pdf_weib4, label="Weibull")

burr12fit4 = stats.burr12.fit(h1)  
pdf_burr124 = stats.burr12.pdf(lnspc, *burr12fit4)  
plt.plot(lnspc, pdf_burr124, label="Burr")

genparetofit4 = stats.genpareto.fit(h1)
pdf_genpareto4 = stats.genpareto.pdf(lnspc, *genparetofit4)
plt.plot(lnspc, pdf_genpareto4, label ="Gen-Pareto")

#KS-Test
print(stats.kstest(h1, lambda k : stats.genpareto.cdf(k, *genparetofit),args=(),N=len(h1),alternative ='two-sided', mode ='approx'))
print(stats.kstest(h1, lambda k : stats.lognorm.cdf(k, *logfit),args=(),N=len(h1),alternative ='two-sided', mode ='approx'))
print(stats.kstest(h1, lambda k : gev.cdf(k, *gevfit),args=(),N=len(h1),alternative ='two-sided', mode ='approx'))
print(stats.kstest(h1, lambda k : stats.weibull_min.cdf(k, *weibfit),args=(),N=len(h1),alternative ='two-sided', mode ='approx'))
print(stats.kstest(h1, lambda k : stats.burr12.cdf(k, *burr12fit),args=(),N=len(h1),alternative ='two-sided', mode ='approx'))

After this runs, I get values like:

KstestResult(statistic=0.065689774346523788, pvalue=2.3778862070128568e-20)
KstestResult(statistic=0.063434691987405312, pvalue=5.2567851875784095e-19)
KstestResult(statistic=0.065047355887551062, pvalue=5.8076254324909468e-20)
KstestResult(statistic=0.25249534411299968, pvalue=8.3670183092211739e-295)
KstestResult(statistic=0.068528435880779559, pvalue=4.1395594967775799e-22)

Are these values reasonable? Is it still possible to chose the best model? Is the best fitted model, the model with the smallest statistic value?

EDIT:

I plotted the CDFs for two fitted distribution.

They seem very well fitted. But I still get those small p-values.

回答1:

Check the AIC criterion for each fit. The least of those will be the best fit. Judging from you KS statistic, the Weibull fits best. Although there are reasons why people don't recommened KS test for paramèters calculated from the samples.

回答2:

The p-values for kstest assumes that the parameters of the distribution are known. They are not appropriate when parameters are estimated. However, as far as I understand, the p-values should be too large in that case, while here they are very small.

From the histogram plot it looks like that there are some regions that are not well matched by any of the distributions. Additionally, there might be some rounding in the data or bunching at some discrete values.

If the sample size is large enough, then any small deviations from the hypothesized distribution will result in a rejection of the hypothesis that the distribution matches the data.

To use ks-test as a selection criterion, we can just look at the ks-statistic or p-values and choose the one that matches best, in this case log-normal. We would get the best fitting distribution among the set tested, but it deviates to some extent from the "true" distribution that generated the data.

来源：https://stackoverflow.com/questions/51305126/kolmogorov-smirnov-test-for-the-fitting-goodness-in-python

标签

python