I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector
The same question has been bothering me for a long time and I decided to make an effort. The idea was to reuse something from scipy.stats
, so that we would have cdf
and ppf
out of the box.
There is a class rv_descrete meant for subclassing. Browsing the sources for something similar in its inheritors I found rv_sample with an interesting description: A 'sample' discrete distribution defined by the support and values.
. The class is not exposed in API, but it is used when you pass values directly to the rv_descrete
.
Thus, here is a possible solution:
import numpy as np
import scipy.stats
# some mapping from numeric values to the frequencies
freqs = np.array([
[1, 3],
[2, 10],
[3, 13],
[4, 12],
[5, 9],
[6, 4],
])
def distrib_from_freqs(arr: np.ndarray) -> scipy.stats.rv_discrete:
pmf = arr[:, 1] / arr[:, 1].sum()
distrib = scipy.stats.rv_discrete(values=(arr[:, 0], pmf))
return distrib
distrib = distrib_from_freqs(freqs)
print(distrib.pmf(freqs[:, 0]))
print(distrib.cdf(freqs[:, 0]))
print(distrib.ppf(distrib.cdf(freqs[:, 0]))) # percentiles
# [0.05882353 0.19607843 0.25490196 0.23529412 0.17647059 0.07843137]
# [0.05882353 0.25490196 0.50980392 0.74509804 0.92156863 1. ]
# [1. 2. 3. 4. 5. 6.]
# max, median, 1st quartile, 3rd quartile
print(distrib.ppf([1.0, 0.5, 0.25, 0.75]))
# [6. 3. 2. 5.]
# the distribution describes values from (0, 1]
# and 0 results with a value right before the minimum:
print(distrib.ppf(0))
# 0.0