Given a set of random numbers drawn from a continuous univariate distribution, find the distribution

后端未结

关注

 6  2108

Given a set of real numbers drawn from a unknown continuous univariate distribution (let\'s say is is one of beta, Cauchy, chi-square, exponential, F, gamma, Laplace, log-no

相关标签:

6条回答

刺人心

2020-12-04 09:08

My first approach would be to generate qq plots of the given data against the possible distributions.

x <- c(15.771062,14.741310,9.081269,11.276436,11.534672,17.980860,13.550017,13.853336,11.262280,11.049087,14.752701,4.481159,11.680758,11.451909,10.001488,11.106817,7.999088,10.591574,8.141551,12.401899,11.215275,13.358770,8.388508,11.875838,3.137448,8.675275,17.381322,12.362328,10.987731,7.600881,14.360674,5.443649,16.024247,11.247233,9.549301,9.709091,13.642511,10.892652,11.760685,11.717966,11.373979,10.543105,10.230631,9.918293,10.565087,8.891209,10.021141,9.152660,10.384917,8.739189,5.554605,8.575793,12.016232,10.862214,4.938752,14.046626,5.279255,11.907347,8.621476,7.933702,10.799049,8.567466,9.914821,7.483575,11.098477,8.033768,10.954300,8.031797,14.288100,9.813787,5.883826,7.829455,9.462013,9.176897,10.153627,4.922607,6.818439,9.480758,8.166601,12.017158,13.279630,14.464876,13.319124,12.331335,3.194438,9.866487,11.337083,8.958164,8.241395,4.289313,5.508243,4.737891,7.577698,9.626720,16.558392,10.309173,11.740863,8.761573,7.099866,10.032640)
> qqnorm(x)

For more info see link

Another possibility is based on the fitdistr function in the MASS package. Here is the different distributions ordered by their log-likelihood

> library(MASS)
> fitdistr(x, 't')$loglik
[1] -252.2659
Warning message:
In log(s) : NaNs produced
> fitdistr(x, 'normal')$loglik
[1] -252.2968
> fitdistr(x, 'logistic')$loglik
[1] -252.2996
> fitdistr(x, 'weibull')$loglik
[1] -252.3507
> fitdistr(x, 'gamma')$loglik
[1] -255.9099
> fitdistr(x, 'lognormal')$loglik
[1] -260.6328
> fitdistr(x, 'exponential')$loglik
[1] -331.8191
Warning messages:
1: In dgamma(x, shape, scale, log) : NaNs produced
2: In dgamma(x, shape, scale, log) : NaNs produced

0 讨论(0)

清歌不尽

2020-12-04 09:12

You could try using the Kolmogorov-Smirnov tests (ks.test in R).

If you have time-to-event data, here's software that does a Bayesian chi squared test against a list of common distributions to report the best fit.

0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-12-04 09:16
Another similar approach is using the fitdistrplus package
```
library(fitdistrplus)
```
Loop through the distributions of interest and generate 'fitdist' objects. Use either "mle" for maximum likelihood estimation or "mme" for matching moment estimation, as the fitting method.
```
f1<-fitdist(x,"norm",method="mle")
```
Use bootstrap re-sampling in order to simulate uncertainty in the parameters of the selected model
```
b_best<-bootdist(f_best)
print(f_best)
plot(f_best)
summary(f_best)
```
The fitdist method allows for using custom distributions or distributions from other packages, provided that the corresponding density function dname, the corresponding distribution function pname and the corresponding quantile function qname have been defined (or even just the density function).

So if you wanted to test the log-likelihood for the inverse normal distribution:
```
library(ig)
fitdist(x,"igt",method="mle",start=list(mu=mean(x),lambda=1))$loglik
```
You may also find Fitting distributions with R helpful.
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2020-12-04 09:24
As others have pointed out, this might be framed as a model selection question. It is a wrong approach to use the distribution that fits the data best without taking into account the complexity of the distribution. This is because the more complicated distribution will generally have better fit, but it will likely overfit the data.

You can use the Akaike Information Criteria (AIC) to take into account the complexity of the distribution. This is still unsatisfactory as you're only considering a limited number of distributions, but is still better than just using the log likelihood.

I use just a few distributions, but you can check the documentation to find others that could be relevant

Using the fitdistrplus you can run:
```
library(fitdistrplus)

distributions = c("norm", "lnorm", "exp",
          "cauchy", "gamma", "logis",
          "weibull")


# the x vector is defined as in the question

# Plot to see which distributions make sense. This should influence
# your choice of candidate distributions
descdist(x, discrete = FALSE, boot = 500)

distr_aic = list()
distr_fit = list()
for (distribution in distributions) {
    distr_fit[[distribution]] = fitdist(x, distribution)
    distr_aic[[distribution]] = distr_fit[[distribution]]$aic
}

> distr_aic
$norm
[1] 5032.269

$lnorm
[1] 5421.815

$exp
[1] 6602.334

$cauchy
[1] 5382.643

$gamma
[1] 5184.17

$logis
[1] 5047.796

$weibull
[1] 5058.336
```
According to our plot and the AIC, it makes sense to use a normal. You can automatize this by just picking the distribution with the minimum AIC. You can check the estimated parameters with
```
> distr_fit[['norm']]
Fitting of the distribution ' norm ' by maximum likelihood 
Parameters:
     estimate Std. Error
mean 9.975849 0.09454476
sd   2.989768 0.06685321
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2020-12-04 09:30

I find it hard to imagine a realistic situation where this would be useful. Why not use a non-parametric tool like a kernel density estimate?

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-04 09:31
(Answer edited to add additional explanation)
1. You can't really find "the" distribution; the actual distribution from which data are drawn can nearly always* be guaranteed not to be in any "laundry list" provided by any such software. At best you can find "a" distribution (more likely several), one that is an adequate description. Even if you find a great fit there are always an infinity of distributions that are arbitrarily close by. Real data tends to be drawn from heterogeneous mixtures of distributions that themselves don't necessarily have simple functional form.
  
  * an example where you might hope to is where you know the data were actually generated from exactly one distribution on a list, but such situations are extremely rare.
2. I don't think just comparing likelihoods is necessarily going to make sense, since some distributions have more parameters than others. AIC might make more sense, except that ...
3. Attempting to identify a "best fitting" distribution from a list of candidates will tend to produce overfitting, and unless the effect of such model selection is accounted for properly will lead to overconfidence (a model that looks great but doesn't actually fit the data not in your sample). There are such possibilities in R (the package fitdistrplus comes to mind), but as a common practice I would advise against the idea. If you must do it, use holdout samples or cross-validation to obtain models with better generalization error.
0 讨论(0)
发布评论:

提交评论
- 加载中...