问题
I recently encountered a thing which made me notice that numpy.var()
and pandas.DataFrame.var()
or pandas.Series.var()
are giving different values. I want to know if there is any difference between them or not?
Here is my dataset.
Country GDP Area Continent
0 India 2.79 3.287 Asia
1 USA 20.54 9.840 North America
2 China 13.61 9.590 Asia
Here is my code:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
catDf.iloc[:,1:-1] = ss.fit_transform(catDf.iloc[:,1:-1])
Now checking Pandas Variance
# Pandas Variance
print(catDf.var())
print(catDf.iloc[:,1:-1].var())
print(catDf.iloc[:,1].var())
print(catDf.iloc[:,2].var())
The output is
GDP 1.5
Area 1.5
dtype: float64
GDP 1.5
Area 1.5
dtype: float64
1.5000000000000002
1.5000000000000002
Whereas it should be 1 as I have used StandardScaler on it.
And for numpy Variance
print(catDf.iloc[:,1:-1].values.var())
print(catDf.iloc[:,1].values.var())
print(catDf.iloc[:,2].values.var())
THe output is
1.0000000000000002
1.0000000000000002
1.0000000000000002
Which seems correct.
回答1:
pandas var
has ddof
of 1
by default, numpy has it at 0
.
The get the same var
in pandas as you're getting in numpy do
catDf.iloc[:,1:-1].var(ddof=0)
This comes down to the difference between population variance and sample variance.
Note the sklearn standard scaler explicitly mention they use a ddof of 0 and that as it is unlikely to affect model performance (as it is just for scaling), they haven't exposed it as a configurable parameter.
来源:https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var