Given a set of random numbers drawn from a continuous univariate distribution, find the distribution

后端 未结 6 2108
情书的邮戳
情书的邮戳 2020-12-04 08:27

Given a set of real numbers drawn from a unknown continuous univariate distribution (let\'s say is is one of beta, Cauchy, chi-square, exponential, F, gamma, Laplace, log-no

相关标签:
6条回答
  • 2020-12-04 09:08

    My first approach would be to generate qq plots of the given data against the possible distributions.

    x <- c(15.771062,14.741310,9.081269,11.276436,11.534672,17.980860,13.550017,13.853336,11.262280,11.049087,14.752701,4.481159,11.680758,11.451909,10.001488,11.106817,7.999088,10.591574,8.141551,12.401899,11.215275,13.358770,8.388508,11.875838,3.137448,8.675275,17.381322,12.362328,10.987731,7.600881,14.360674,5.443649,16.024247,11.247233,9.549301,9.709091,13.642511,10.892652,11.760685,11.717966,11.373979,10.543105,10.230631,9.918293,10.565087,8.891209,10.021141,9.152660,10.384917,8.739189,5.554605,8.575793,12.016232,10.862214,4.938752,14.046626,5.279255,11.907347,8.621476,7.933702,10.799049,8.567466,9.914821,7.483575,11.098477,8.033768,10.954300,8.031797,14.288100,9.813787,5.883826,7.829455,9.462013,9.176897,10.153627,4.922607,6.818439,9.480758,8.166601,12.017158,13.279630,14.464876,13.319124,12.331335,3.194438,9.866487,11.337083,8.958164,8.241395,4.289313,5.508243,4.737891,7.577698,9.626720,16.558392,10.309173,11.740863,8.761573,7.099866,10.032640)
    > qqnorm(x)
    

    For more info see link

    Another possibility is based on the fitdistr function in the MASS package. Here is the different distributions ordered by their log-likelihood

    > library(MASS)
    > fitdistr(x, 't')$loglik
    [1] -252.2659
    Warning message:
    In log(s) : NaNs produced
    > fitdistr(x, 'normal')$loglik
    [1] -252.2968
    > fitdistr(x, 'logistic')$loglik
    [1] -252.2996
    > fitdistr(x, 'weibull')$loglik
    [1] -252.3507
    > fitdistr(x, 'gamma')$loglik
    [1] -255.9099
    > fitdistr(x, 'lognormal')$loglik
    [1] -260.6328
    > fitdistr(x, 'exponential')$loglik
    [1] -331.8191
    Warning messages:
    1: In dgamma(x, shape, scale, log) : NaNs produced
    2: In dgamma(x, shape, scale, log) : NaNs produced
    
    0 讨论(0)
  • 2020-12-04 09:12

    You could try using the Kolmogorov-Smirnov tests (ks.test in R).

    If you have time-to-event data, here's software that does a Bayesian chi squared test against a list of common distributions to report the best fit.

    0 讨论(0)
  • 2020-12-04 09:16

    Another similar approach is using the fitdistrplus package

    library(fitdistrplus)
    

    Loop through the distributions of interest and generate 'fitdist' objects. Use either "mle" for maximum likelihood estimation or "mme" for matching moment estimation, as the fitting method.

    f1<-fitdist(x,"norm",method="mle")
    

    Use bootstrap re-sampling in order to simulate uncertainty in the parameters of the selected model

    b_best<-bootdist(f_best)
    print(f_best)
    plot(f_best)
    summary(f_best)
    

    The fitdist method allows for using custom distributions or distributions from other packages, provided that the corresponding density function dname, the corresponding distribution function pname and the corresponding quantile function qname have been defined (or even just the density function).

    So if you wanted to test the log-likelihood for the inverse normal distribution:

    library(ig)
    fitdist(x,"igt",method="mle",start=list(mu=mean(x),lambda=1))$loglik
    

    You may also find Fitting distributions with R helpful.

    0 讨论(0)
  • 2020-12-04 09:24

    As others have pointed out, this might be framed as a model selection question. It is a wrong approach to use the distribution that fits the data best without taking into account the complexity of the distribution. This is because the more complicated distribution will generally have better fit, but it will likely overfit the data.

    You can use the Akaike Information Criteria (AIC) to take into account the complexity of the distribution. This is still unsatisfactory as you're only considering a limited number of distributions, but is still better than just using the log likelihood.

    I use just a few distributions, but you can check the documentation to find others that could be relevant

    Using the fitdistrplus you can run:

    library(fitdistrplus)
    
    distributions = c("norm", "lnorm", "exp",
              "cauchy", "gamma", "logis",
              "weibull")
    
    
    # the x vector is defined as in the question
    
    # Plot to see which distributions make sense. This should influence
    # your choice of candidate distributions
    descdist(x, discrete = FALSE, boot = 500)
    
    distr_aic = list()
    distr_fit = list()
    for (distribution in distributions) {
        distr_fit[[distribution]] = fitdist(x, distribution)
        distr_aic[[distribution]] = distr_fit[[distribution]]$aic
    }
    
    > distr_aic
    $norm
    [1] 5032.269
    
    $lnorm
    [1] 5421.815
    
    $exp
    [1] 6602.334
    
    $cauchy
    [1] 5382.643
    
    $gamma
    [1] 5184.17
    
    $logis
    [1] 5047.796
    
    $weibull
    [1] 5058.336
    

    According to our plot and the AIC, it makes sense to use a normal. You can automatize this by just picking the distribution with the minimum AIC. You can check the estimated parameters with

    > distr_fit[['norm']]
    Fitting of the distribution ' norm ' by maximum likelihood 
    Parameters:
         estimate Std. Error
    mean 9.975849 0.09454476
    sd   2.989768 0.06685321
    
    0 讨论(0)
  • 2020-12-04 09:30

    I find it hard to imagine a realistic situation where this would be useful. Why not use a non-parametric tool like a kernel density estimate?

    0 讨论(0)
  • 2020-12-04 09:31

    (Answer edited to add additional explanation)

    1. You can't really find "the" distribution; the actual distribution from which data are drawn can nearly always* be guaranteed not to be in any "laundry list" provided by any such software. At best you can find "a" distribution (more likely several), one that is an adequate description. Even if you find a great fit there are always an infinity of distributions that are arbitrarily close by. Real data tends to be drawn from heterogeneous mixtures of distributions that themselves don't necessarily have simple functional form.

      * an example where you might hope to is where you know the data were actually generated from exactly one distribution on a list, but such situations are extremely rare.

    2. I don't think just comparing likelihoods is necessarily going to make sense, since some distributions have more parameters than others. AIC might make more sense, except that ...

    3. Attempting to identify a "best fitting" distribution from a list of candidates will tend to produce overfitting, and unless the effect of such model selection is accounted for properly will lead to overconfidence (a model that looks great but doesn't actually fit the data not in your sample). There are such possibilities in R (the package fitdistrplus comes to mind), but as a common practice I would advise against the idea. If you must do it, use holdout samples or cross-validation to obtain models with better generalization error.

    0 讨论(0)
提交回复
热议问题