Pandas describe vs scipy.stats percentileofscore with NaN?

后端 未结 2 1531
被撕碎了的回忆
被撕碎了的回忆 2021-01-29 07:54

I\'m having a weird situation, where pd.describe is giving me percentile markers that disagree with scipy.stats percentileofscore, because of NaNs, I think.

My df is:

相关标签:
2条回答
  • 2021-01-29 07:56

    the answer is very simple.

    There is no universally accepted formula for computing percentiles, in particular when your data contains ties or when it cannot be perfectly broken down in equal-size buckets.

    For instance, have a look at the documentation in R. There are more than seven types of formulas! https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

    At the end, it comes down to understanding which formula is used and whether the differences are big enough to be a problem in your case.

    0 讨论(0)
  • 2021-01-29 08:19

    scipy.stats.percentileofscore does not ignore nan, nor does it check for the value and handle it in some special way. It is just another floating point value in your data. This means the behavior of percentileofscore with data containing nan is undefined, because of the behavior of nan in comparisons:

    In [44]: np.nan > 0
    Out[44]: False
    
    In [45]: np.nan < 0
    Out[45]: False
    
    In [46]: np.nan == 0
    Out[46]: False
    
    In [47]: np.nan == np.nan
    Out[47]: False
    

    Those results are all correct--that is how nan is supposed to behave. But that means, in order to know how percentileofscore handles nan, you have to know how the code does comparisons. And that is an implementation detail that you shouldn't have to know, and that you can't rely on to be the same in future versions of scipy.

    If you investigate the behavior of percentfileofscore, you'll find that it behaves as if nan was infinite. For example, if you replace nan with a value larger than any other value in the input, you'll get the same results:

    In [53]: percentileofscore([10, 20, 25, 30, np.nan, np.nan], 18)
    Out[53]: 16.666666666666664
    
    In [54]: percentileofscore([10, 20, 25, 30, 999, 999], 18)
    Out[54]: 16.666666666666664
    

    Unfortunately, you can't rely on this behavior. If the implementation changes in the future, nan might end up behaving like negative infinity, or have some other unspecified behavior.

    The solution to this "problem" is simple: don't give percentileofscore any nan values. You'll have to clean up your data first. Note that this can be as simple as:

    result = percentileofscore(a[~np.isnan(a)], score)
    
    0 讨论(0)
提交回复
热议问题