问题
I have two columns of data representing the same quantity; one column is from my training data, the other is from my validation data.
I know how to calculate the percentile rankings of the training data efficiently using:
pandas.DataFrame(training_data).rank(pct = True).values
My question is, how can I efficiently get a similar set of percentile rankings of the validation data column relative to the training data column? That is, for each value in the validation data column, how can I find what its percentile ranking would be relative to all the values in the training data column?
I've tried doing this:
def percentrank(input_data,comparison_data):
rescaled_data = np.zeros(input_data.size)
for idx,datum in enumerate(input_data):
rescaled_data[idx] =scipy.stats.percentileofscore(comparison_data,datum)
return rescaled_data/100
But I'm not sure if this is even correct, and on top of that it's incredibly slow because it is doing a lot of redundant calculations for each value in the for loop.
Any help would be greatly appreciated!
回答1:
Here's a solution. Sort the training data. Then use searchsorted on the validation data.
import pandas as pd
import numpy as np
# Generate Dummy Data
df_train = pd.DataFrame({'Values': 1000*np.random.rand(15712)})
#Sort Data
df_train = df_train.sort_values('Values')
# Calculating Rank and Rank_Pct for demo purposes
#but note that it is not needed for the solution
# The ranking of the validation data below does not depend on this
df_train['Rank'] = df_train.rank()
df_train['Rank_Pct']= df_train.Values.rank(pct=True)
# Demonstrate how Rank Percentile is calculated
# This gives the same value as .rank(pct=True)
pct_increment = 1./len(df_train)
df_train['Rank_Pct_Manual'] = df_train.Rank*pct_increment
df_train.head()
Values Rank Rank_Pct Rank_Pct_Manual
2724 0.006174 1.0 0.000064 0.000064
3582 0.016264 2.0 0.000127 0.000127
5534 0.095691 3.0 0.000191 0.000191
944 0.141442 4.0 0.000255 0.000255
7566 0.161766 5.0 0.000318 0.000318
Now use searchsorted to get Rank_Pct of validation data
# Generate Dummy Validation Data
df_validation = pd.DataFrame({'Values': 1000*np.random.rand(1000)})
# Note searchsorted returns array index.
# In sorted list rank is the same as the array index +1
df_validation['Rank_Pct'] = (1 + df_train.Values.searchsorted(df_validation.Values))*pct_increment
Here is first few lines of final df_validation dataframe:
print df_validation.head()
Values Rank_Pct
0 307.378334 0.304290
1 744.247034 0.744208
2 669.223821 0.670825
3 149.797030 0.145621
4 317.742713 0.314218
回答2:
A small improvement to the nice solution above is to average the positions found by searching from the left and searching from the right:
df_validation['Rank_Pct'] = (0.5 + 0.5*df_train.Values.searchsorted(df_validation.Values, side='left') + 0.5*df_train.Values.searchsorted(df_validation.Values, side='right'))*pct_increment
This change is important in cases where a value occurs multiple times. Consider searching for 2 in [1,2,2,2,4] - searching from the left gives 1, while search from the right gives 3. Averaging the two gives the same percentile ranking as the pandas .rank(pct=True) routine.
来源:https://stackoverflow.com/questions/43145715/how-to-calculate-a-percentile-ranking-of-a-column-of-data-relative-to-another-co