I have searched around and to my surprise it seems that this question has not been answered.
I have a Numpy array containing 10000 values from measurements. I have plott
Testing if a large sample coming from measurements fits a given distribution is usually tricky, because any departure from the distribution will be identified by the test as an outlier, and make the test reject the distribution.
This is why I generally use the QQ-Plot for this purpose. This is a graphical tool where the X-axis plots the quantiles of the data and the Y-axis plots the quantiles of the fitted distribution. The graphical analysis allows to select which part of the distribution is important for the specific study : central dispersion, lower tail or upper tail.
To do this, I use the DrawQQplot function.
import openturns as ot
import numpy as np
sample = ot.Sample(s, 1)
tested_distribution = ot.NormalFactory().build(sample)
QQ_plot = ot.VisualTest.DrawQQplot(sample, tested_distribution)
This produces the following graphics.
The QQ-Plot validates the distribution the points are on the test line. In the current situation, the fit is excellent, although we notice that extreme quantiles of the data do not fit so well (as we might expect, given the low probability density of these events).
Just to see what happens often, I tried the BetaFactory
, which is obviously a wrong choice here!
tested_distribution = ot.BetaFactory().build(sample)
QQ_plot = ot.VisualTest.DrawQQplot(sample, tested_distribution)
This produces:
The qq-plot is now clear: the fit is acceptable in the central area, but cannot be accepted for quantiles lower than -0.2 or greater than 0.2. Notice that the Beta and its 4 parameters is sufficiently flexible to perform a good job of fitting the data in the [0.2, 0.2] interval.
With a large sample size, I would rather use a KernelSmoothing than an histogram. This is more accurate i.e. closer to the true, unknown PDF (in terms of AMISE error, the kernel smoothing can reach 1/n^{4/5} instead of 1/n^{2/3} for the histogram) and is a continuous distribution (your distribution seems continuous). If the sample is really large, binning can be activated, which reduces the CPU cost.
Assuming you have used the test correctly, my guess is that you have a small deviation from a normal distribution and because your sample size is so large, even small deviations will lead to a rejection of the null hypothesis of a normal distribution.
One possibility is to visually inspect your data by plotting a normed
histogram with a large number of bins and the pdf with loc=data.mean()
and scale=data.std()
.
There are alternative test for testing normality, statsmodels has Anderson-Darling and Lillifors (Kolmogorov-Smirnov) tests when the distribution parameters are estimated.
However, I expect that the results will not differ much given the large sample size.
The main question is whether you want to test whether your sample comes "exactly" from a normal distribution, or whether you are just interested in whether your sample comes from a distribution that is very close to the normal distribution, close in terms of practical usage.
To elaborate on the last point:
http://jpktd.blogspot.ca/2012/10/tost-statistically-significant.html http://www.graphpad.com/guides/prism/6/statistics/index.htm?testing_for_equivalence2.htm
As the sample size increases a hypothesis test gains more power, that means that the test will be able to reject the null hypothesis of equality even for smaller and smaller differences. If we keep our significance level fixed, then eventually we will reject tiny differences that we don't really care about.
An alternative type of hypothesis test is where we want to show that our sample is close to the given point hypothesis, for example two samples have almost the same mean. The problem is that we have to define what our equivalence region is.
In the case of goodness of fit tests we need to choose a distance measure and define a threshold for the distance measure between the sample and the hypothesized distribution. I have not found any explanation where intuition would help to choose this distance threshold.
stats.normaltest is based on deviations of skew and kurtosis from those of the normal distribution.
Anderson-Darling is based on a integral of the weighted squared differences between the cdf.
Kolmogorov-Smirnov is based on the maximum absolute difference between the cdf.
chisquare for binned data would be based on the weighted sum of squared bin probabilities.
and so on.
I only ever tried equivalence testing with binned or discretized data, where I used a threshold from some reference cases which was still rather arbitrary.
In medical equivalence testing there are some predefined standards to specify when two treatments can be considered as equivalent, or similarly as inferior or superior in the one sided version.