问题
I have a very simple dataframe:
df = pd.DataFrame([5,7,10,15,19,21,21,22,22,23,23,23,23,23,24,24,24,24,25], columns=['val'])
df.median() = 23 which is right because from 19 values in the list, 23 is 10th value (9 values before 23, and 9 values after 23)
I tried to calculate 1st and 3rt quartile as:
df.quantile([.25, .75])
val
0.25 20.0
0.75 23.5
I would have expected that from 9 values bellow median that 1st quartile should be 19, but as you can see above, python says it is 20. Similarly, for 3rd quartile, fifth number from right to left is 24, but python shows 23.5.
How does pandas calculates quartile?
Original question is from the following link: https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule
回答1:
Python doesn't create the quantile, Pandas does. Here take a look at the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html It actually uses numpy's percentile function https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html#numpy.percentile
回答2:
It uses linear interpolation by default. Here's how to use nearest instead:
df['val'].quantile([0.25, 0.75], interpolation='nearest')
Out:
0.25 19
0.75 24
More info from the official documentation on how the interpolation
parameter works:
This optional parameter specifies the interpolation method to use,
when the desired quantile lies between two data points `i` and `j`:
* linear: `i + (j - i) * fraction`, where `fraction` is the
fractional part of the index surrounded by `i` and `j`.
* lower: `i`.
* higher: `j`.
* nearest: `i` or `j` whichever is nearest.
* midpoint: (`i` + `j`) / 2.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
来源:https://stackoverflow.com/questions/55009203/how-does-pandas-calculate-quartiles