Fitting empirical distribution to theoretical ones with Scipy (Python)?

前端 未结 9 724
醉话见心
醉话见心 2020-11-22 05:28

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]<

9条回答
  •  再見小時候
    2020-11-22 05:52

    With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.

    To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:

    import statsmodels.api as sm
    dta = sm.datasets.elnino.load_pandas().data
    dta['YEAR'] = dta.YEAR.astype(int).astype(str)
    dta = dta.set_index('YEAR').T.unstack()
    data = dta.values
    

    It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method. Once done, the BestModelBIC static method returns the best model and the corresponding BIC score.

    sample = ot.Sample([[p] for p in data]) # data reshaping
    tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
    best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                       tested_factories)
    print("Best=",best_model)
    

    which prints:

    Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)
    

    In order to graphically compare the fit to the histogram, I use the drawPDF methods of the best distribution.

    import openturns.viewer as otv
    graph = ot.HistogramFactory().build(sample).drawPDF()
    bestPDF = best_model.drawPDF()
    bestPDF.setColors(["blue"])
    graph.add(bestPDF)
    graph.setTitle("Best BIC fit")
    name = best_model.getImplementation().getClassName()
    graph.setLegends(["Histogram",name])
    graph.setXTitle("Temperature (°C)")
    otv.View(graph)
    

    This produces:

    More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.

提交回复
热议问题