percentiles from counts of values

前端 未结 2 1626
误落风尘
误落风尘 2021-01-13 08:41

I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector

2条回答
  •  隐瞒了意图╮
    2021-01-13 09:19

    Using collections.Counter for solving the first problem (calculating and combining frequency tables) following Julien Palard's suggestion, and my implementation for the second problem (calculating percentiles from frequency tables):

    from collections import Counter
    
    def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
        """Returns [(percentile, value)] with nearest rank percentiles.
        Percentile 0: , 100: .
        cnts_dict: { :  }
        percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
        """
        assert all(0 <= p <= 100 for p in percentiles_to_calc)
        percentiles = []
        num = sum(cnts_dict.values())
        cnts = sorted(cnts_dict.items())
        curr_cnts_pos = 0  # current position in cnts
        curr_pos = cnts[0][1]  # sum of freqs up to current_cnts_pos
        for p in sorted(percentiles_to_calc):
            if p < 100:
                percentile_pos = p / 100.0 * num
                while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
                    curr_cnts_pos += 1
                    curr_pos += cnts[curr_cnts_pos][1]
                percentiles.append((p, cnts[curr_cnts_pos][0]))
            else:
                percentiles.append((p, cnts[-1][0]))  # we could add a small value
        return percentiles
    
    cnts_dict = Counter()
    for segment in segment_iterator:
        cnts_dict += Counter(segment)
    
    percentiles = calc_percentiles(cnts_dict)
    

提交回复
热议问题