Map each list value to its corresponding percentile

后端 未结 9 502
夕颜
夕颜 2020-11-29 00:11

I\'d like to create a function that takes a (sorted) list as its argument and outputs a list containing each element\'s corresponding percentile.

For example,

相关标签:
9条回答
  • 2020-11-29 00:18

    In terms of complexity, I think reptilicus's answer is not optimal. It takes O(n^2) time.

    Here is a solution that takes O(n log n) time.

    def list_to_percentiles(numbers):
        pairs = zip(numbers, range(len(numbers)))
        pairs.sort(key=lambda p: p[0])
        result = [0 for i in range(len(numbers))]
        for rank in xrange(len(numbers)):
            original_index = pairs[rank][1]
            result[original_index] = rank * 100.0 / (len(numbers)-1)
        return result
    

    I'm not sure, but I think this is the optimal time complexity you can get. The rough reason I think it's optimal is because the information of all of the percentiles is essentially equivalent to the information of the sorted list, and you can't get better than O(n log n) for sorting.

    EDIT: Depending on your definition of "percentile" this may not always give the correct result. See BrenBarn's answer for more explanation and for a better solution which makes use of scipy/numpy.

    0 讨论(0)
  • 2020-11-29 00:18

    This version allows also to pass exact percentiles values used to ranking:

    def what_pctl_number_of(x, a, pctls=np.arange(1, 101)):
        return np.argmax(np.sign(np.append(np.percentile(x, pctls), np.inf) - a))
    

    So it's possible to find out what's percentile number value falls for provided percentiles:

    _x = np.random.randn(100, 1)
    what_pctl_number_of(_x, 1.6, [25, 50, 75, 100])
    

    Output:

    3
    

    so it hits to 75 ~ 100 range

    0 讨论(0)
  • 2020-11-29 00:19

    for a pure python function to calculate a percentile score for a given item, compared to the population distribution (a list of scores), I pulled this from the scipy source code and removed all references to numpy:

    def percentileofscore(a, score, kind='rank'):    
        n = len(a)
        if n == 0:
            return 100.0
        left = len([item for item in a if item < score])
        right = len([item for item in a if item <= score])
        if kind == 'rank':
            pct = (right + left + (1 if right > left else 0)) * 50.0/n
            return pct
        elif kind == 'strict':
            return left / n * 100
        elif kind == 'weak':
            return right / n * 100
        elif kind == 'mean':
            pct = (left + right) / n * 50
            return pct
        else:
            raise ValueError("kind can only be 'rank', 'strict', 'weak' or 'mean'")
    

    source: https://github.com/scipy/scipy/blob/v1.2.1/scipy/stats/stats.py#L1744-L1835

    Given that calculating percentiles is trickier than one would think, but way less complicated than the full scipy/numpy/scikit package, this is the best for light-weight deployment. The original code filters for only nonzero-values better, but otherwise, the math is the same. The optional parameter controls how it handles values that are in between two other values.

    For this use case, one can call this function for each item in a list using the map() function.

    0 讨论(0)
  • 2020-11-29 00:20

    For me the best solution is to use QuantileTransformer in sklearn.preprocessing.

    from sklearn.preprocessing import QuantileTransformer
    fn = lambda input_list : QuantileTransformer(100).fit_transform(np.array(input_list).reshape([-1,1])).ravel().tolist()
    input_raw = [1, 2, 3, 4, 17]
    output_perc = fn( input_raw )
    
    print "Input=", input_raw
    print "Output=", np.round(output_perc,2)
    

    Here is the output

    Input= [1, 2, 3, 4, 17]
    Output= [ 0.    0.25  0.5   0.75  1.  ]
    

    Note: this function has two salient features:

    1. input raw data is NOT necessarily sorted.
    2. input raw data is NOT necessarily single column.
    0 讨论(0)
  • 2020-11-29 00:27

    I think your example input/output does not correspond to typical ways of calculating percentile. If you calculate the percentile as "proportion of data points strictly less than this value", then the top value should be 0.8 (since 4 of 5 values are less than the largest one). If you calculate it as "percent of data points less than or equal to this value", then the bottom value should be 0.2 (since 1 of 5 values equals the smallest one). Thus the percentiles would be [0, 0.2, 0.4, 0.6, 0.8] or [0.2, 0.4, 0.6, 0.8, 1]. Your definition seems to be "the number of data points strictly less than this value, considered as a proportion of the number of data points not equal to this value", but in my experience this is not a common definition (see for instance wikipedia).

    With the typical percentile definitions, the percentile of a data point is equal to its rank divided by the number of data points. (See for instance this question on Stats SE asking how to do the same thing in R.) Differences in how to compute the percentile amount to differences in how to compute the rank (for instance, how to rank tied values). The scipy.stats.percentileofscore function provides four ways of computing percentiles:

    >>> x = [1, 1, 2, 2, 17]
    >>> [stats.percentileofscore(x, a, 'rank') for a in x]
    [30.0, 30.0, 70.0, 70.0, 100.0]
    >>> [stats.percentileofscore(x, a, 'weak') for a in x]
    [40.0, 40.0, 80.0, 80.0, 100.0]
    >>> [stats.percentileofscore(x, a, 'strict') for a in x]
    [0.0, 0.0, 40.0, 40.0, 80.0]
    >>> [stats.percentileofscore(x, a, 'mean') for a in x]
    [20.0, 20.0, 60.0, 60.0, 90.0]
    

    (I used a dataset containing ties to illustrate what happens in such cases.)

    The "rank" method assigns tied groups a rank equal to the average of the ranks they would cover (i.e., a three-way tie for 2nd place gets a rank of 3 because it "takes up" ranks 2, 3 and 4). The "weak" method assigns a percentile based on the proportion of data points less than or equal to a given point; "strict" is the same but counts proportion of points strictly less than the given point. The "mean" method is the average of the latter two.

    As Kevin H. Lin noted, calling percentileofscore in a loop is inefficient since it has to recompute the ranks on every pass. However, these percentile calculations can be easily replicated using different ranking methods provided by scipy.stats.rankdata, letting you calculate all the percentiles at once:

    >>> from scipy import stats
    >>> stats.rankdata(x, "average")/len(x)
    array([ 0.3,  0.3,  0.7,  0.7,  1. ])
    >>> stats.rankdata(x, 'max')/len(x)
    array([ 0.4,  0.4,  0.8,  0.8,  1. ])
    >>> (stats.rankdata(x, 'min')-1)/len(x)
    array([ 0. ,  0. ,  0.4,  0.4,  0.8])
    

    In the last case the ranks are adjusted down by one to make them start from 0 instead of 1. (I've omitted "mean", but it could easily be obtained by averaging the results of the latter two methods.)

    I did some timings. With small data such as that in your example, using rankdata is somewhat slower than Kevin H. Lin's solution (presumably due to the overhead scipy incurs in converting things to numpy arrays under the hood) but faster than calling percentileofscore in a loop as in reptilicus's answer:

    In [11]: %timeit [stats.percentileofscore(x, i) for i in x]
    1000 loops, best of 3: 414 µs per loop
    
    In [12]: %timeit list_to_percentiles(x)
    100000 loops, best of 3: 11.1 µs per loop
    
    In [13]: %timeit stats.rankdata(x, "average")/len(x)
    10000 loops, best of 3: 39.3 µs per loop
    

    With a large dataset, however, the performance advantage of numpy takes effect and using rankdata is 10 times faster than Kevin's list_to_percentiles:

    In [18]: x = np.random.randint(0, 10000, 1000)
    
    In [19]: %timeit [stats.percentileofscore(x, i) for i in x]
    1 loops, best of 3: 437 ms per loop
    
    In [20]: %timeit list_to_percentiles(x)
    100 loops, best of 3: 1.08 ms per loop
    
    In [21]: %timeit stats.rankdata(x, "average")/len(x)
    10000 loops, best of 3: 102 µs per loop
    

    This advantage will only become more pronounced on larger and larger datasets.

    0 讨论(0)
  • 2020-11-29 00:31

    I think you want scipy.stats.percentileofscore

    Example:

    percentileofscore([1, 2, 3, 4], 3)
    75.0
    percentiles = [percentileofscore(data, i) for i in data]
    
    0 讨论(0)
提交回复
热议问题