percentiles from counts of values

前端 未结 2 1627
误落风尘
误落风尘 2021-01-13 08:41

I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector

相关标签:
2条回答
  • 2021-01-13 09:05

    The same question has been bothering me for a long time and I decided to make an effort. The idea was to reuse something from scipy.stats, so that we would have cdf and ppf out of the box.

    There is a class rv_descrete meant for subclassing. Browsing the sources for something similar in its inheritors I found rv_sample with an interesting description: A 'sample' discrete distribution defined by the support and values.. The class is not exposed in API, but it is used when you pass values directly to the rv_descrete.

    Thus, here is a possible solution:

    import numpy as np
    import scipy.stats
    
    # some mapping from numeric values to the frequencies
    freqs = np.array([
        [1, 3],
        [2, 10],
        [3, 13],
        [4, 12],
        [5, 9],
        [6, 4],
    ])
    
    def distrib_from_freqs(arr: np.ndarray) -> scipy.stats.rv_discrete:
        pmf = arr[:, 1] / arr[:, 1].sum()
        distrib = scipy.stats.rv_discrete(values=(arr[:, 0], pmf))
        return distrib
    
    distrib = distrib_from_freqs(freqs)
    
    print(distrib.pmf(freqs[:, 0]))
    print(distrib.cdf(freqs[:, 0]))
    print(distrib.ppf(distrib.cdf(freqs[:, 0])))  # percentiles
    
    # [0.05882353 0.19607843 0.25490196 0.23529412 0.17647059 0.07843137]
    # [0.05882353 0.25490196 0.50980392 0.74509804 0.92156863 1.        ]
    # [1. 2. 3. 4. 5. 6.]
    
    # max, median, 1st quartile, 3rd quartile
    print(distrib.ppf([1.0, 0.5, 0.25, 0.75]))
    # [6. 3. 2. 5.]
    
    # the distribution describes values from (0, 1] 
    #   and 0 results with a value right before the minimum:
    print(distrib.ppf(0))
    # 0.0
    
    0 讨论(0)
  • 2021-01-13 09:19

    Using collections.Counter for solving the first problem (calculating and combining frequency tables) following Julien Palard's suggestion, and my implementation for the second problem (calculating percentiles from frequency tables):

    from collections import Counter
    
    def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
        """Returns [(percentile, value)] with nearest rank percentiles.
        Percentile 0: <min_value>, 100: <max_value>.
        cnts_dict: { <value>: <count> }
        percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
        """
        assert all(0 <= p <= 100 for p in percentiles_to_calc)
        percentiles = []
        num = sum(cnts_dict.values())
        cnts = sorted(cnts_dict.items())
        curr_cnts_pos = 0  # current position in cnts
        curr_pos = cnts[0][1]  # sum of freqs up to current_cnts_pos
        for p in sorted(percentiles_to_calc):
            if p < 100:
                percentile_pos = p / 100.0 * num
                while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
                    curr_cnts_pos += 1
                    curr_pos += cnts[curr_cnts_pos][1]
                percentiles.append((p, cnts[curr_cnts_pos][0]))
            else:
                percentiles.append((p, cnts[-1][0]))  # we could add a small value
        return percentiles
    
    cnts_dict = Counter()
    for segment in segment_iterator:
        cnts_dict += Counter(segment)
    
    percentiles = calc_percentiles(cnts_dict)
    
    0 讨论(0)
提交回复
热议问题