I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector
The same question has been bothering me for a long time and I decided to make an effort. The idea was to reuse something from scipy.stats
, so that we would have cdf
and ppf
out of the box.
There is a class rv_descrete meant for subclassing. Browsing the sources for something similar in its inheritors I found rv_sample with an interesting description: A 'sample' discrete distribution defined by the support and values.
. The class is not exposed in API, but it is used when you pass values directly to the rv_descrete
.
Thus, here is a possible solution:
import numpy as np
import scipy.stats
# some mapping from numeric values to the frequencies
freqs = np.array([
[1, 3],
[2, 10],
[3, 13],
[4, 12],
[5, 9],
[6, 4],
])
def distrib_from_freqs(arr: np.ndarray) -> scipy.stats.rv_discrete:
pmf = arr[:, 1] / arr[:, 1].sum()
distrib = scipy.stats.rv_discrete(values=(arr[:, 0], pmf))
return distrib
distrib = distrib_from_freqs(freqs)
print(distrib.pmf(freqs[:, 0]))
print(distrib.cdf(freqs[:, 0]))
print(distrib.ppf(distrib.cdf(freqs[:, 0]))) # percentiles
# [0.05882353 0.19607843 0.25490196 0.23529412 0.17647059 0.07843137]
# [0.05882353 0.25490196 0.50980392 0.74509804 0.92156863 1. ]
# [1. 2. 3. 4. 5. 6.]
# max, median, 1st quartile, 3rd quartile
print(distrib.ppf([1.0, 0.5, 0.25, 0.75]))
# [6. 3. 2. 5.]
# the distribution describes values from (0, 1]
# and 0 results with a value right before the minimum:
print(distrib.ppf(0))
# 0.0
Using collections.Counter
for solving the first problem (calculating and combining frequency tables) following Julien Palard's suggestion, and my implementation for the second problem (calculating percentiles from frequency tables):
from collections import Counter
def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
"""Returns [(percentile, value)] with nearest rank percentiles.
Percentile 0: <min_value>, 100: <max_value>.
cnts_dict: { <value>: <count> }
percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
"""
assert all(0 <= p <= 100 for p in percentiles_to_calc)
percentiles = []
num = sum(cnts_dict.values())
cnts = sorted(cnts_dict.items())
curr_cnts_pos = 0 # current position in cnts
curr_pos = cnts[0][1] # sum of freqs up to current_cnts_pos
for p in sorted(percentiles_to_calc):
if p < 100:
percentile_pos = p / 100.0 * num
while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
curr_cnts_pos += 1
curr_pos += cnts[curr_cnts_pos][1]
percentiles.append((p, cnts[curr_cnts_pos][0]))
else:
percentiles.append((p, cnts[-1][0])) # we could add a small value
return percentiles
cnts_dict = Counter()
for segment in segment_iterator:
cnts_dict += Counter(segment)
percentiles = calc_percentiles(cnts_dict)