sklearn StandardScaler differece between “with_std=False or True” and “with_mean=False or True”

问题

I am trying to standardize some data to be able to apply PCA to it. I am using sklearn.preprocessing.StandardScaler. I am having trouble to understand the difference between using "True" or "False" in the parameters "with_mean" and "with_std". Here is the description of the command:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Can someone give a more extended explanation?

Thank you very much!

回答1:

I have provided more details in this post https://stackoverflow.com/a/50879522/5025009, but let me just explain this here as well.

The standardation of the data (each column/feature/variable indivivually) involves the following equations:

Explanation:

If you set with_mean and with_std to False, then the mean μ is set to 0 and the std to 1, assuming that the columns/features are coming from the normal gaussian distribution (which has 0 mean and 1 std).

If you set with_mean and with_std to True, then you will actually use the true μ and σ of your data. This is the most common approach.

回答2:

A standard scaler is usually used to fit a normal distribution with the data, and then calculate the Z-scores. This thus means that first the mean μ and standard deviation σ of the data are calculated, and then the Z-scores are calculated with z = (x - μ) / σ.

By setting with_mean or with_std to False, we respectively set the mean μ to 0 and the standard deviation σ to 1. If both are set to False, we thus calculate the Z-score of a standard normal distribution [wiki].

The main use case of setting with_mean to False is processing sparse matrices. Sparse matrices contain a significant amount of zeros, and are therefore stored in a way that the zeros usually use no (or very little) memory. If we would fit the mean, and then calculate the z-score, it is almost certain that all zeros will be mapped to non-zero values, and thus use (significant amounts of) memory. For large sparse matrices, that can result in a memory error: the data is that large, that the memory is not able to store the matrix anymore. By setting μ=0, this means that values that are zero, will map on zero. The result of the standard scaler is a sparse matrix with the same shape.

来源：https://stackoverflow.com/questions/57349987/sklearn-standardscaler-differece-between-with-std-false-or-true-and-with-mean

标签

python

scikit-learn

pca

decomposition