Difference between numpy var() and pandas var()

问题

I recently encountered a thing which made me notice that numpy.var() and pandas.DataFrame.var() or pandas.Series.var() are giving different values. I want to know if there is any difference between them or not?

Here is my dataset.


     Country    GDP     Area    Continent
0      India    2.79    3.287   Asia
1      USA     20.54    9.840   North America
2      China    13.61   9.590   Asia

Here is my code:


from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

catDf.iloc[:,1:-1] = ss.fit_transform(catDf.iloc[:,1:-1])

Now checking Pandas Variance

# Pandas Variance
print(catDf.var())
print(catDf.iloc[:,1:-1].var())
print(catDf.iloc[:,1].var())
print(catDf.iloc[:,2].var())

The output is

GDP     1.5
Area    1.5
dtype: float64
GDP     1.5
Area    1.5
dtype: float64
1.5000000000000002
1.5000000000000002

Whereas it should be 1 as I have used StandardScaler on it.

And for numpy Variance

print(catDf.iloc[:,1:-1].values.var())
print(catDf.iloc[:,1].values.var())
print(catDf.iloc[:,2].values.var())

THe output is

1.0000000000000002
1.0000000000000002
1.0000000000000002

Which seems correct.

回答1:

pandas var has ddof of 1 by default, numpy has it at 0.

The get the same var in pandas as you're getting in numpy do

catDf.iloc[:,1:-1].var(ddof=0)

This comes down to the difference between population variance and sample variance.

Note the sklearn standard scaler explicitly mention they use a ddof of 0 and that as it is unlikely to affect model performance (as it is just for scaling), they haven't exposed it as a configurable parameter.

来源：https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var

标签

python

pandas

numpy

statistics