Difference between numpy var() and pandas var()

旧城冷巷雨未停 提交于 2021-02-10 07:32:51

问题


I recently encountered a thing which made me notice that numpy.var() and pandas.DataFrame.var() or pandas.Series.var() are giving different values. I want to know if there is any difference between them or not?

Here is my dataset.


     Country    GDP     Area    Continent
0      India    2.79    3.287   Asia
1      USA     20.54    9.840   North America
2      China    13.61   9.590   Asia

Here is my code:


from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

catDf.iloc[:,1:-1] = ss.fit_transform(catDf.iloc[:,1:-1])

Now checking Pandas Variance

# Pandas Variance
print(catDf.var())
print(catDf.iloc[:,1:-1].var())
print(catDf.iloc[:,1].var())
print(catDf.iloc[:,2].var())

The output is

GDP     1.5
Area    1.5
dtype: float64
GDP     1.5
Area    1.5
dtype: float64
1.5000000000000002
1.5000000000000002

Whereas it should be 1 as I have used StandardScaler on it.

And for numpy Variance

print(catDf.iloc[:,1:-1].values.var())
print(catDf.iloc[:,1].values.var())
print(catDf.iloc[:,2].values.var())

THe output is

1.0000000000000002
1.0000000000000002
1.0000000000000002

Which seems correct.


回答1:


pandas var has ddof of 1 by default, numpy has it at 0.

The get the same var in pandas as you're getting in numpy do

catDf.iloc[:,1:-1].var(ddof=0)

This comes down to the difference between population variance and sample variance.

Note the sklearn standard scaler explicitly mention they use a ddof of 0 and that as it is unlikely to affect model performance (as it is just for scaling), they haven't exposed it as a configurable parameter.



来源:https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!