sklearn standardscaler result different to manual result

前端 未结 1 1601
囚心锁ツ
囚心锁ツ 2021-01-12 17:44

I used the sklearn standardscaler (mean removal and variance scaling) to scale a dataframe and compared it to a dataframe where I \"manually\" subtracted the mean and divide

相关标签:
1条回答
  • 2021-01-12 18:13

    scikit-learn uses np.std which by default is the population standard deviation (where the sum of squared deviations are divided by the number of observations) and pandas uses the sample standard deviations (where the denominator is number of observations - 1) (see Wikipedia's standard deviation article). That's a correction factor to have an unbiased estimate of the population standard deviation and determined by the degrees of freedom (ddof). So by default, numpy's and scikit-learn's calculations use ddof=0 whereas pandas uses ddof=1 (docs).

    DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

    Return sample standard deviation over requested axis.

    Normalized by N-1 by default. This can be changed using the ddof argument

    If you change your pandas version to:

    df_standardized_manual = (df - df.mean()) / df.std(ddof=0)
    

    The differences will be practically zero:

            Alcohol    Malic acid           Ash  Alcalinity of ash     Magnesium
    0 -8.215650e-15 -5.551115e-16  3.191891e-15       0.000000e+00  2.220446e-16
    1 -8.715251e-15 -4.996004e-16  3.441691e-15       0.000000e+00  0.000000e+00
    2 -8.715251e-15 -3.955170e-16  2.886580e-15      -5.551115e-17  1.387779e-17
    3 -8.437695e-15 -4.440892e-16  3.164136e-15      -1.110223e-16  1.110223e-16
    4 -8.659740e-15 -3.330669e-16  2.886580e-15       5.551115e-17  2.220446e-16
    
    0 讨论(0)
提交回复
热议问题