Fitting empirical distribution to theoretical ones with Scipy (Python)?

前端 未结 9 710
醉话见心
醉话见心 2020-11-22 05:28

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]<

相关标签:
9条回答
  • 2020-11-22 05:45

    AFAICU, your distribution is discrete (and nothing but discrete). Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes. So, an example to demonstrate this:

    In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
    In []: counts= asarray(bincount(values), dtype= float)
    In []: cdf= counts.cumsum()/ counts.sum()
    

    Thus, probability of seeing values higher than 1 is simply (according to the complementary cumulative distribution function (ccdf):

    In []: 1- cdf[1]
    Out[]: 0.40000000000000002
    

    Please note that ccdf is closely related to survival function (sf), but it's also defined with discrete distributions, whereas sf is defined only for contiguous distributions.

    0 讨论(0)
  • 2020-11-22 05:49

    It sounds like probability density estimation problem to me.

    from scipy.stats import gaussian_kde
    occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
    values = range(0,48)
    kde = gaussian_kde(map(float, occurences))
    p = kde(values)
    p = p/sum(p)
    print "P(x>=1) = %f" % sum(p[1:])
    

    Also see http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html.

    0 讨论(0)
  • 2020-11-22 05:50

    Distribution Fitting with Sum of Square Error (SSE)

    This is an update and modification to Saullo's answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution's histogram and the data's histogram.

    Example Fitting

    Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

    All Distributions

    Best Fit Distribution

    Example Code

    %matplotlib inline
    
    import warnings
    import numpy as np
    import pandas as pd
    import scipy.stats as st
    import statsmodels as sm
    import matplotlib
    import matplotlib.pyplot as plt
    
    matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
    matplotlib.style.use('ggplot')
    
    # Create models from data
    def best_fit_distribution(data, bins=200, ax=None):
        """Model data by finding best fit distribution to data"""
        # Get histogram of original data
        y, x = np.histogram(data, bins=bins, density=True)
        x = (x + np.roll(x, -1))[:-1] / 2.0
    
        # Distributions to check
        DISTRIBUTIONS = [        
            st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
            st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
            st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
            st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
            st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
            st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
            st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
            st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
            st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
            st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
        ]
    
        # Best holders
        best_distribution = st.norm
        best_params = (0.0, 1.0)
        best_sse = np.inf
    
        # Estimate distribution parameters from data
        for distribution in DISTRIBUTIONS:
    
            # Try to fit the distribution
            try:
                # Ignore warnings from data that can't be fit
                with warnings.catch_warnings():
                    warnings.filterwarnings('ignore')
    
                    # fit dist to data
                    params = distribution.fit(data)
    
                    # Separate parts of parameters
                    arg = params[:-2]
                    loc = params[-2]
                    scale = params[-1]
    
                    # Calculate fitted PDF and error with fit in distribution
                    pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                    sse = np.sum(np.power(y - pdf, 2.0))
    
                    # if axis pass in add to plot
                    try:
                        if ax:
                            pd.Series(pdf, x).plot(ax=ax)
                        end
                    except Exception:
                        pass
    
                    # identify if this distribution is better
                    if best_sse > sse > 0:
                        best_distribution = distribution
                        best_params = params
                        best_sse = sse
    
            except Exception:
                pass
    
        return (best_distribution.name, best_params)
    
    def make_pdf(dist, params, size=10000):
        """Generate distributions's Probability Distribution Function """
    
        # Separate parts of parameters
        arg = params[:-2]
        loc = params[-2]
        scale = params[-1]
    
        # Get sane start and end points of distribution
        start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
        end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
    
        # Build PDF and turn into pandas Series
        x = np.linspace(start, end, size)
        y = dist.pdf(x, loc=loc, scale=scale, *arg)
        pdf = pd.Series(y, x)
    
        return pdf
    
    # Load data from statsmodels datasets
    data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())
    
    # Plot for comparison
    plt.figure(figsize=(12,8))
    ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
    # Save plot limits
    dataYLim = ax.get_ylim()
    
    # Find best fit distribution
    best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
    best_dist = getattr(st, best_fit_name)
    
    # Update plots
    ax.set_ylim(dataYLim)
    ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
    ax.set_xlabel(u'Temp (°C)')
    ax.set_ylabel('Frequency')
    
    # Make PDF with best params 
    pdf = make_pdf(best_dist, best_fit_params)
    
    # Display
    plt.figure(figsize=(12,8))
    ax = pdf.plot(lw=2, label='PDF', legend=True)
    data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)
    
    param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
    param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
    dist_str = '{}({})'.format(best_fit_name, param_str)
    
    ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
    ax.set_xlabel(u'Temp. (°C)')
    ax.set_ylabel('Frequency')
    
    0 讨论(0)
  • 2020-11-22 05:50

    While many of the above answers are completely valid, no one seems to answer your question completely, specifically the part:

    I don't know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

    The parametric approach

    This is the process you're describing of using some theoretical distribution and fitting the parameters to your data and there's some excellent answers how to do this.

    The non-parametric approach

    However, it's also possible to use a non-parametric approach to your problem, which means you do not assume any underlying distribution at all.

    By using the so-called Empirical distribution function which equals: Fn(x)= SUM( I[X<=x] ) / n. So the proportion of values below x.

    As was pointed out in one of the above answers is that what you're interested in is the inverse CDF (cumulative distribution function), which is equal to 1-F(x)

    It can be shown that the empirical distribution function will converge to whatever 'true' CDF that generated your data.

    Furthermore, it is straightforward to construct a 1-alpha confidence interval by:

    L(X) = max{Fn(x)-en, 0}
    U(X) = min{Fn(x)+en, 0}
    en = sqrt( (1/2n)*log(2/alpha)
    

    Then P( L(X) <= F(X) <= U(X) ) >= 1-alpha for all x.

    I'm quite surprised that PierrOz answer has 0 votes, while it's a completely valid answer to the question using a non-parametric approach to estimating F(x).

    Note that the issue you mention of P(X>=x)=0 for any x>47 is simply a personal preference that might lead you to chose the parametric approach above the non-parametric approach. Both approaches however are completely valid solutions to your problem.

    For more details and proofs of the above statements I would recommend having a look at 'All of Statistics: A Concise Course in Statistical Inference by Larry A. Wasserman'. An excellent book on both parametric and non-parametric inference.

    EDIT: Since you specifically asked for some python examples it can be done using numpy:

    import numpy as np
    
    def empirical_cdf(data, x):
        return np.sum(x<=data)/len(data)
    
    def p_value(data, x):
        return 1-empirical_cdf(data, x)
    
    # Generate some data for demonstration purposes
    data = np.floor(np.random.uniform(low=0, high=48, size=30000))
    
    print(empirical_cdf(data, 20))
    print(p_value(data, 20)) # This is the value you're interested in
    
    0 讨论(0)
  • 2020-11-22 05:52

    With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.

    To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:

    import statsmodels.api as sm
    dta = sm.datasets.elnino.load_pandas().data
    dta['YEAR'] = dta.YEAR.astype(int).astype(str)
    dta = dta.set_index('YEAR').T.unstack()
    data = dta.values
    

    It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method. Once done, the BestModelBIC static method returns the best model and the corresponding BIC score.

    sample = ot.Sample([[p] for p in data]) # data reshaping
    tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
    best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                       tested_factories)
    print("Best=",best_model)
    

    which prints:

    Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)
    

    In order to graphically compare the fit to the histogram, I use the drawPDF methods of the best distribution.

    import openturns.viewer as otv
    graph = ot.HistogramFactory().build(sample).drawPDF()
    bestPDF = best_model.drawPDF()
    bestPDF.setColors(["blue"])
    graph.add(bestPDF)
    graph.setTitle("Best BIC fit")
    name = best_model.getImplementation().getClassName()
    graph.setLegends(["Histogram",name])
    graph.setXTitle("Temperature (°C)")
    otv.View(graph)
    

    This produces:

    More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.

    0 讨论(0)
  • 2020-11-22 05:56

    There are 82 implemented distribution functions in SciPy 0.12.0. You can test how some of them fit to your data using their fit() method. Check the code below for more details:

    enter image description here

    import matplotlib.pyplot as plt
    import scipy
    import scipy.stats
    size = 30000
    x = scipy.arange(size)
    y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
    h = plt.hist(y, bins=range(48))
    
    dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
    
    for dist_name in dist_names:
        dist = getattr(scipy.stats, dist_name)
        param = dist.fit(y)
        pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
        plt.plot(pdf_fitted, label=dist_name)
        plt.xlim(0,47)
    plt.legend(loc='upper right')
    plt.show()
    

    References:

    - Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

    - Distribution fitting with Scipy

    And here a list with the names of all distribution functions available in Scipy 0.12.0 (VI):

    dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy'] 
    
    0 讨论(0)
提交回复
热议问题