I would like to select the top entries in a Pandas dataframe base on the entries of a specific column by using df_selected = df_targets.head(N)
.
Each
The method shown in my previous answer is now deprecated.
In stead it is best to use pandas.Categorical
as shown here.
So:
list_ordering = ["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"]
df["target"] = pd.Categorical(df["target"], categories=list_ordering)
Thanks to jerzrael's input and references,
I like this sliced solution:
list_ordering = ["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"]
df["target"] = df["target"].astype("category", categories=list_ordering, ordered=True)
I think you need Categorical with parameter ordered=True
and then sorting by sort_values works very nice:
If check documentation of Categorical:
Ordered Categoricals can be sorted according to the custom order of the categories and can have a min and max value.
import pandas as pd
df = pd.DataFrame({'a': ['GOTV', 'Persuasion', 'Likely Supporter',
'GOTV', 'Persuasion', 'Persuasion+GOTV']})
df.a = pd.Categorical(df.a,
categories=["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"],
ordered=True)
print (df)
a
0 GOTV
1 Persuasion
2 Likely Supporter
3 GOTV
4 Persuasion
5 Persuasion+GOTV
print (df.a)
0 GOTV
1 Persuasion
2 Likely Supporter
3 GOTV
4 Persuasion
5 Persuasion+GOTV
Name: a, dtype: category
Categories (4, object): [Likely Supporter < GOTV < Persuasion < Persuasion+GOTV]
df.sort_values('a', inplace=True)
print (df)
a
2 Likely Supporter
0 GOTV
3 GOTV
1 Persuasion
4 Persuasion
5 Persuasion+GOTV