Given a set of real numbers drawn from a unknown continuous univariate distribution (let\'s say is is one of beta, Cauchy, chi-square, exponential, F, gamma, Laplace, log-no
My first approach would be to generate qq plots of the given data against the possible distributions.
x <- c(15.771062,14.741310,9.081269,11.276436,11.534672,17.980860,13.550017,13.853336,11.262280,11.049087,14.752701,4.481159,11.680758,11.451909,10.001488,11.106817,7.999088,10.591574,8.141551,12.401899,11.215275,13.358770,8.388508,11.875838,3.137448,8.675275,17.381322,12.362328,10.987731,7.600881,14.360674,5.443649,16.024247,11.247233,9.549301,9.709091,13.642511,10.892652,11.760685,11.717966,11.373979,10.543105,10.230631,9.918293,10.565087,8.891209,10.021141,9.152660,10.384917,8.739189,5.554605,8.575793,12.016232,10.862214,4.938752,14.046626,5.279255,11.907347,8.621476,7.933702,10.799049,8.567466,9.914821,7.483575,11.098477,8.033768,10.954300,8.031797,14.288100,9.813787,5.883826,7.829455,9.462013,9.176897,10.153627,4.922607,6.818439,9.480758,8.166601,12.017158,13.279630,14.464876,13.319124,12.331335,3.194438,9.866487,11.337083,8.958164,8.241395,4.289313,5.508243,4.737891,7.577698,9.626720,16.558392,10.309173,11.740863,8.761573,7.099866,10.032640)
> qqnorm(x)
For more info see link
Another possibility is based on the fitdistr function in the MASS package. Here is the different distributions ordered by their log-likelihood
> library(MASS)
> fitdistr(x, 't')$loglik
[1] -252.2659
Warning message:
In log(s) : NaNs produced
> fitdistr(x, 'normal')$loglik
[1] -252.2968
> fitdistr(x, 'logistic')$loglik
[1] -252.2996
> fitdistr(x, 'weibull')$loglik
[1] -252.3507
> fitdistr(x, 'gamma')$loglik
[1] -255.9099
> fitdistr(x, 'lognormal')$loglik
[1] -260.6328
> fitdistr(x, 'exponential')$loglik
[1] -331.8191
Warning messages:
1: In dgamma(x, shape, scale, log) : NaNs produced
2: In dgamma(x, shape, scale, log) : NaNs produced
You could try using the Kolmogorov-Smirnov tests (ks.test
in R).
If you have time-to-event data, here's software that does a Bayesian chi squared test against a list of common distributions to report the best fit.
Another similar approach is using the fitdistrplus package
library(fitdistrplus)
Loop through the distributions of interest and generate 'fitdist' objects. Use either "mle" for maximum likelihood estimation
or "mme" for matching moment estimation
, as the fitting method.
f1<-fitdist(x,"norm",method="mle")
Use bootstrap re-sampling in order to simulate uncertainty in the parameters of the selected model
b_best<-bootdist(f_best)
print(f_best)
plot(f_best)
summary(f_best)
The fitdist method allows for using custom distributions or distributions from other packages, provided that the corresponding density function dname
, the corresponding distribution function pname
and the corresponding quantile function qname
have been defined (or even just the density function).
So if you wanted to test the log-likelihood for the inverse normal distribution:
library(ig)
fitdist(x,"igt",method="mle",start=list(mu=mean(x),lambda=1))$loglik
You may also find Fitting distributions with R helpful.
As others have pointed out, this might be framed as a model selection question. It is a wrong approach to use the distribution that fits the data best without taking into account the complexity of the distribution. This is because the more complicated distribution will generally have better fit, but it will likely overfit the data.
You can use the Akaike Information Criteria (AIC) to take into account the complexity of the distribution. This is still unsatisfactory as you're only considering a limited number of distributions, but is still better than just using the log likelihood.
I use just a few distributions, but you can check the documentation to find others that could be relevant
Using the fitdistrplus
you can run:
library(fitdistrplus)
distributions = c("norm", "lnorm", "exp",
"cauchy", "gamma", "logis",
"weibull")
# the x vector is defined as in the question
# Plot to see which distributions make sense. This should influence
# your choice of candidate distributions
descdist(x, discrete = FALSE, boot = 500)
distr_aic = list()
distr_fit = list()
for (distribution in distributions) {
distr_fit[[distribution]] = fitdist(x, distribution)
distr_aic[[distribution]] = distr_fit[[distribution]]$aic
}
> distr_aic
$norm
[1] 5032.269
$lnorm
[1] 5421.815
$exp
[1] 6602.334
$cauchy
[1] 5382.643
$gamma
[1] 5184.17
$logis
[1] 5047.796
$weibull
[1] 5058.336
According to our plot and the AIC, it makes sense to use a normal. You can automatize this by just picking the distribution with the minimum AIC. You can check the estimated parameters with
> distr_fit[['norm']]
Fitting of the distribution ' norm ' by maximum likelihood
Parameters:
estimate Std. Error
mean 9.975849 0.09454476
sd 2.989768 0.06685321
I find it hard to imagine a realistic situation where this would be useful. Why not use a non-parametric tool like a kernel density estimate?
(Answer edited to add additional explanation)
You can't really find "the" distribution; the actual distribution from which data are drawn can nearly always* be guaranteed not to be in any "laundry list" provided by any such software. At best you can find "a" distribution (more likely several), one that is an adequate description. Even if you find a great fit there are always an infinity of distributions that are arbitrarily close by. Real data tends to be drawn from heterogeneous mixtures of distributions that themselves don't necessarily have simple functional form.
* an example where you might hope to is where you know the data were actually generated from exactly one distribution on a list, but such situations are extremely rare.
I don't think just comparing likelihoods is necessarily going to make sense, since some distributions have more parameters than others. AIC might make more sense, except that ...
Attempting to identify a "best fitting" distribution from a list of candidates will tend to produce overfitting, and unless the effect of such model selection is accounted for properly will lead to overconfidence (a model that looks great but doesn't actually fit the data not in your sample). There are such possibilities in R (the package fitdistrplus
comes to mind), but as a common practice I would advise against the idea. If you must do it, use holdout samples or cross-validation to obtain models with better generalization error.