问题
I am working with a dataframe where I have weight each row by its probability. Now, I want to select the row with the highest probability and I am using pandas idxmax() to do so, however when there are ties, it just returns the first row among the ones that tie. In my case, I want to get all the rows that tie.
Furthermore, I am doing this as part of a research project where I am processing millions a dataframes like the one below, so keeping it fast is an issue.
Example:
My data looks like this:
data = [['chr1',100,200,0.2],
['ch1',300,500,0.3],
['chr1', 300, 500, 0.3],
['chr1', 600, 800, 0.3]]
From this list, I create a pandas dataframe as follows:
weighted = pd.DataFrame.from_records(data,columns=['chrom','start','end','probability'])
Which looks like this:
chrom start end probability
0 chr1 100 200 0.2
1 ch1 300 500 0.3
2 chr1 300 500 0.3
3 chr1 600 800 0.3
Then select the row that fits argmax(probability) using:
selected = weighted.ix[weighted['probability'].idxmax()]
Which of course returns:
chrom ch1
start 300
end 500
probability 0.3
Name: 1, dtype: object
Is there a (fast) way to the get all the values when there are ties?
thanks!
回答1:
Well, this might be solution you are looking for:
weighted.loc[weighted['probability']==weighted['probability'].max()].T
# 1 2 3
#chrom ch1 chr1 chr1
#start 300 300 600
#end 500 500 800
#probability 0.3 0.3 0.3
回答2:
The bottleneck lies in calculating the Boolean indexer. You can bypass the overhead associated with pd.Series
objects by performing calculations with the underlying NumPy array:
df2 = df[df['probability'].values == df['probability'].values.max()]
Performance benchmarking with the Pandas equivalent:
# tested on Pandas v0.19.2, Python 3.6.0
df = pd.concat([df]*100000, ignore_index=True)
%timeit df['probability'].eq(df['probability'].max()) # 3.78 ms per loop
%timeit df['probability'].values == df['probability'].values.max() # 416 µs per loop
来源:https://stackoverflow.com/questions/52588298/pandas-idxmax-return-all-rows-in-case-of-ties