Python: Get array indexes of quartiles

问题

I am using the following code to calculate the quartiles of a given data set:

#!/usr/bin/python

import numpy as np

series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]

p1 = 25
p2 = 50
p3 = 75

q1 = np.percentile(series,  p1)
q2 = np.percentile(series,  p2)
q3 = np.percentile(series,  p3)

print('percentile(' + str(p1) + '): ' + str(q1))
print('percentile(' + str(p2) + '): ' + str(q2))
print('percentile(' + str(p3) + '): ' + str(q3))

The percentile function returns the quartiles, however, I would also like to get the indexes which it used to mark the bounderies of the quartiles. Is there any way to do this?

回答1:

Since the data is sorted, you could just use numpy.searchsorted to return the indices at which to insert the values to maintain sorted order. You can specify which 'side' to insert the values.

>>> np.searchsorted(series,q1)
1
>>> np.searchsorted(series,q1,side='right')
11
>>> np.searchsorted(series,q2)
1
>>> np.searchsorted(series,q3)
11
>>> np.searchsorted(series,q3,side='right')
13

回答2:

Try this:

import numpy as np
import pandas as pd
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
thresholds = [25,50,75]
output = pd.DataFrame([np.percentile(series,x) for x in thresholds], index = thresholds, columns = ['quartiles'])
output

By making it a dataframe, you can assign the index pretty easily.

回答3:

Assuming that the data is always sorted (thanks @juanpa.arrivillaga), you can use the rank method from the Pandas Series class. rank() takes several arguments. One of them is pct:

pct : boolean, default False

Computes percentage rank of data

There are different ways of calculating the percentage rank. These methods are controlled by the argument method:

method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}

You need the method "max":

max: highest rank in group

Let's look at the output of the rank() method with these parameters:

import numpy as np
import pandas as pd

series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]

S = pd.Series(series)
percentage_rank = S.rank(method="max", pct=True)
print(percentage_rank)

This gives you basically the percentile for every entry in the Series:

0     0.0625
1     0.6875
2     0.6875
3     0.6875
4     0.6875
5     0.6875
6     0.6875
7     0.6875
8     0.6875
9     0.6875
10    0.6875
11    0.8125
12    0.8125
13    0.8750
14    0.9375
15    1.0000
dtype: float64

In order to retrieve the index for the three percentiles, you look up the first element in the Series that has an equal or higher percentage rank than the percentile you're interested in. The index of that element is the index that you need.

index25 = S.index[percentage_rank >= 0.25][0]
index50 = S.index[percentage_rank >= 0.50][0]
index75 = S.index[percentage_rank >= 0.75][0]

print("25 percentile: index {}, value {}".format(index25, S[index25]))
print("50 percentile: index {}, value {}".format(index50, S[index50]))
print("75 percentile: index {}, value {}".format(index75, S[index75]))

This gives you the output:

25 percentile: index 1, value 2
50 percentile: index 1, value 2
75 percentile: index 11, value 5

来源：https://stackoverflow.com/questions/42958697/python-get-array-indexes-of-quartiles

标签

python

numpy

percentile

quartile