问题
I am trying to standardize some data to be able to apply PCA to it. I am using sklearn.preprocessing.StandardScaler. I am having trouble to understand the difference between using "True" or "False" in the parameters "with_mean" and "with_std". Here is the description of the command:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Can someone give a more extended explanation?
Thank you very much!
回答1:
I have provided more details in this post https://stackoverflow.com/a/50879522/5025009, but let me just explain this here as well.
The standardation of the data (each column/feature/variable indivivually) involves the following equations:
Explanation:
If you set with_mean
and with_std
to False
, then the mean μ
is set to 0
and the std
to 1, assuming that the columns/features are coming from the normal gaussian distribution (which has 0 mean and 1 std).
If you set with_mean
and with_std
to True
, then you will actually use the true μ
and σ
of your data. This is the most common approach.
回答2:
A standard scaler is usually used to fit a normal distribution with the data, and then calculate the Z-scores. This thus means that first the mean μ and standard deviation σ of the data are calculated, and then the Z-scores are calculated with z = (x - μ) / σ.
By setting with_mean
or with_std
to False
, we respectively set the mean μ to 0
and the standard deviation σ to 1. If both are set to False
, we thus calculate the Z-score of a standard normal distribution [wiki].
The main use case of setting with_mean
to False
is processing sparse matrices. Sparse matrices contain a significant amount of zeros, and are therefore stored in a way that the zeros usually use no (or very little) memory. If we would fit the mean, and then calculate the z-score, it is almost certain that all zeros will be mapped to non-zero values, and thus use (significant amounts of) memory. For large sparse matrices, that can result in a memory error: the data is that large, that the memory is not able to store the matrix anymore. By setting μ=0, this means that values that are zero, will map on zero. The result of the standard scaler is a sparse matrix with the same shape.
来源:https://stackoverflow.com/questions/57349987/sklearn-standardscaler-differece-between-with-std-false-or-true-and-with-mean