问题
I am using the following code to calculate the quartiles of a given data set:
#!/usr/bin/python
import numpy as np
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
p1 = 25
p2 = 50
p3 = 75
q1 = np.percentile(series, p1)
q2 = np.percentile(series, p2)
q3 = np.percentile(series, p3)
print('percentile(' + str(p1) + '): ' + str(q1))
print('percentile(' + str(p2) + '): ' + str(q2))
print('percentile(' + str(p3) + '): ' + str(q3))
The percentile function returns the quartiles, however, I would also like to get the indexes which it used to mark the bounderies of the quartiles. Is there any way to do this?
回答1:
Since the data is sorted, you could just use numpy.searchsorted to return the indices at which to insert the values to maintain sorted order. You can specify which 'side' to insert the values.
>>> np.searchsorted(series,q1)
1
>>> np.searchsorted(series,q1,side='right')
11
>>> np.searchsorted(series,q2)
1
>>> np.searchsorted(series,q3)
11
>>> np.searchsorted(series,q3,side='right')
13
回答2:
Try this:
import numpy as np
import pandas as pd
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
thresholds = [25,50,75]
output = pd.DataFrame([np.percentile(series,x) for x in thresholds], index = thresholds, columns = ['quartiles'])
output
By making it a dataframe, you can assign the index pretty easily.
回答3:
Assuming that the data is always sorted (thanks @juanpa.arrivillaga), you can use the rank
method from the Pandas Series class. rank()
takes several arguments. One of them is pct
:
pct : boolean, default False
Computes percentage rank of data
There are different ways of calculating the percentage rank. These methods are controlled by the argument method
:
method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}
You need the method "max"
:
max: highest rank in group
Let's look at the output of the rank()
method with these parameters:
import numpy as np
import pandas as pd
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
S = pd.Series(series)
percentage_rank = S.rank(method="max", pct=True)
print(percentage_rank)
This gives you basically the percentile for every entry in the Series
:
0 0.0625
1 0.6875
2 0.6875
3 0.6875
4 0.6875
5 0.6875
6 0.6875
7 0.6875
8 0.6875
9 0.6875
10 0.6875
11 0.8125
12 0.8125
13 0.8750
14 0.9375
15 1.0000
dtype: float64
In order to retrieve the index for the three percentiles, you look up the first element in the Series
that has an equal or higher percentage rank than the percentile you're interested in. The index of that element is the index that you need.
index25 = S.index[percentage_rank >= 0.25][0]
index50 = S.index[percentage_rank >= 0.50][0]
index75 = S.index[percentage_rank >= 0.75][0]
print("25 percentile: index {}, value {}".format(index25, S[index25]))
print("50 percentile: index {}, value {}".format(index50, S[index50]))
print("75 percentile: index {}, value {}".format(index75, S[index75]))
This gives you the output:
25 percentile: index 1, value 2
50 percentile: index 1, value 2
75 percentile: index 11, value 5
来源:https://stackoverflow.com/questions/42958697/python-get-array-indexes-of-quartiles