I\'m trying to transform my dataset to a normal distribution.
0 8.298511e-03
1 3.055319e-01
2 6.938647e-02
3 2.904091e-02
4 7.422441e-0
Is your data that you are sending to boxcox
1-dimensional ndarray
?
Second way could be adding shift parameter by summing shift
(see details from the link) to all of the ndarray
elements before sending it to boxcox and subtracting shift
from the resulting array elements (if I have understood boxcox
algorithm correctly, that could be solution in your case, too).
https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.boxcox.html
Rather than normal boxcox, you can use boxcox1p. It adds 1 to x so there won't be any "0" record
from scipy.special import boxcox1p
scipy.special.boxcox1p(x, lmbda)
For more info check out the docs at https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.boxcox1p.html
Your data contains the value 0 (at index 134). When boxcox
says the data must be positive, it means strictly positive.
What is the meaning of your data? Does 0 make sense? Is that 0 actually a very small number that was rounded down to 0?
You could simply discard that 0. Alternatively, you could do something like the following. (This amounts to temporarily discarding the 0, and then using -1/λ for the transformed value of 0, where λ is the Box-Cox transformation parameter.)
First, create some data that contains one 0 (all other values are positive):
In [13]: np.random.seed(8675309)
In [14]: data = np.random.gamma(1, 1, size=405)
In [15]: data[100] = 0
(In your code, you would replace that with, say, data = df.values
.)
Copy the strictly positive data to posdata
:
In [16]: posdata = data[data > 0]
Find the optimal Box-Cox transformation, and verify that λ is positive. This work-around doesn't work if λ ≤ 0.
In [17]: bcdata, lam = boxcox(posdata)
In [18]: lam
Out[18]: 0.244049919975582
Make a new array to hold that result, along with the limiting value of the transform of 0 (which is -1/λ):
In [19]: x = np.empty_like(data)
In [20]: x[data > 0] = bcdata
In [21]: x[data == 0] = -1/lam
The following plot shows the histograms of data
and x
.