问题
I was wondering how to calculate skewness and kurtosis correctly in pandas.
Pandas gives some values for skew()
and kurtosis()
values but they seem much different from scipy.stats
values. Which one to trust pandas or scipy.stats
?
Here is my code:
import numpy as np
import scipy.stats as stats
import pandas as pd
np.random.seed(100)
x = np.random.normal(size=(20))
kurtosis_scipy = stats.kurtosis(x)
kurtosis_pandas = pd.DataFrame(x).kurtosis()[0]
print(kurtosis_scipy, kurtosis_pandas)
# -0.5270409758168872
# -0.31467107631025604
skew_scipy = stats.skew(x)
skew_pandas = pd.DataFrame(x).skew()[0]
print(skew_scipy, skew_pandas)
# -0.41070929017558555
# -0.44478877631598901
Versions:
print(np.__version__, pd.__version__, scipy.__version__)
1.11.0 0.20.0 0.19.0
回答1:
bias=False
print(
stats.kurtosis(x, bias=False), pd.DataFrame(x).kurtosis()[0],
stats.skew(x, bias=False), pd.DataFrame(x).skew()[0],
sep='\n'
)
-0.31467107631025515
-0.31467107631025604
-0.4447887763159889
-0.444788776315989
回答2:
Pandas calculate UNBIASED estimator of the population kurtosis. Look at the Wikipedia for formulas: https://www.wikiwand.com/en/Kurtosis
Calculate kurtosis from scratch
import numpy as np
import pandas as pd
import scipy
x = np.array([0, 3, 4, 1, 2, 3, 0, 2, 1, 3, 2, 0,
2, 2, 3, 2, 5, 2, 3, 999])
k2 = x.var(ddof=1) # default numpy is biased, ddof = 0
sum_term = ((x-xbar)**4).sum()
factor = (n+1) * n / (n-1) / (n-2) / (n-3)
second = - 3 * (n-1) * (n-1) / (n-2) / (n-3)
first = factor * sum_term / k2 / k2
G2 = first + second
G2 # 19.998428728659768
Calculate kurtosis using numpy/scipy
scipy.stats.kurtosis(x,bias=False) # 19.998428728659757
Calculate kurtosis using pandas
pd.DataFrame(x).kurtosis() # 19.998429
Similarly, you can also calculate skewness.
来源:https://stackoverflow.com/questions/56758125/how-to-find-skewness-and-kurtosis-correctly-in-pandas