问题
Given a test dataset as follows:
id city name
0 1 bj NaN
1 2 bj jack
2 3 bj NaN
3 4 bj jim
4 5 sh NaN
5 6 sh NaN
6 7 sh steve
7 8 sh fiona
8 9 sh NaN
How could I groupby city
and drop NaN
rows for name
while preserving one only for each group
? Many thanks.
The expected result will like this:
id city name
0 1 bj NaN
1 2 bj jack
2 4 bj jim
3 5 sh NaN
4 7 sh steve
5 8 sh fiona
New dataset read by df = pd.read_clipboard(na_filter = False)
from excel file, please note N/A
should not be considered as NaN
:
newcode build_name floor rent_id rent_name
0 1210010403 C栋 25 1765228 (株)有延商店上海事务所
1 1210010403 C栋 25 1765229 N/A
2 1210010403 C栋 25 1765229 N/A
3 1210010403 C栋 25 1765229 N/A
4 1210010403 C栋 25 1765230 上海皇瑾文化传媒有限公司
5 1210010403 C栋 25 1765229 N/A
6 1210010403 C栋 25 1765231 上海农邦实业有限公司
7 1210010403 C栋 25 1765232 N/A
8 1210010403 C栋 25 1765231 上海农NA邦实业有限公司
Code: df[df['rent_name'].ne('N/A') | ~df.duplicated(subset=['newcode', 'build_name', 'floor'])]
, same result with df[~(df['rent_name'].eq('N/A') & df.duplicated(subset = ['newcode', 'build_name', 'floor'], keep = 'first'))]
Out:
newcode build_name floor rent_id rent_name
0 1210010403 C栋 25 1765228 (株)有延商店上海事务所
4 1210010403 C栋 25 1765230 上海皇瑾文化传媒有限公司
6 1210010403 C栋 25 1765231 上海农邦实业有限公司
8 1210010403 C栋 25 1765231 上海农NA邦实业有限公司
You can see one N/A
row is missing in the result, I don't know why.
Desired output:
newcode build_name floor rent_id rent_name
0 1210010403 C栋 25 1765228 (株)有延商店上海事务所
1 1210010403 C栋 25 1765229 N/A
4 1210010403 C栋 25 1765230 上海皇瑾文化传媒有限公司
6 1210010403 C栋 25 1765231 上海农邦实业有限公司
8 1210010403 C栋 25 1765231 上海农NA邦实业有限公司
回答1:
Chain condition for test not missing values or first duplicated values per city
, name:
df = df[df['name'].notna() | ~df.duplicated(subset=['city', 'name'])]
print(df)
id city name
0 1 bj NaN
1 2 bj jack
3 4 bj jim
4 5 sh NaN
6 7 sh steve
7 8 sh fiona
EDIT: For test strings N/A
use Series.ne:
df = df[df['name'].ne('N/A') | ~df.duplicated(subset=['city', 'name'])]
print(df)
id city name
0 1 bj N/A
1 2 bj jack
3 4 bj jim
4 5 sh N/A
6 7 sh steve
7 8 sh fiona
If want test multiple values use Series.isin with inverted mask:
df = df[~df['name'].isin(['N/N','N/A']) | ~df.duplicated(subset=['city', 'name'])]
print(df)
id city name
0 1 bj N/A
1 2 bj jack
3 4 bj jim
4 5 sh N/A
6 7 sh steve
7 8 sh fiona
EDIT:
df = df[df['name'].notna() | ~df.duplicated(subset=['city', 'name'])]
print(df)
id city name
0 1 bj NaN
1 2 bj jack
3 4 bj jim
4 5 sh NaN
6 7 sh steve
7 8 sh fiona
9 10 gz NaN
EDIT1: For test duplicated is necessary add column with NaN
, here rent_name
:
df = df[df['rent_name'].ne('N/A') |
~df.duplicated(subset=['newcode', 'build_name', 'floor', 'rent_name'])]
print (df)
newcode build_name floor rent_id rent_name
0 1210010403 C栋 25 1765228 (株)有延商店上海事务所
1 1210010403 C栋 25 1765229 N/A
4 1210010403 C栋 25 1765230 上海皇瑾文化传媒有限公司
6 1210010403 C栋 25 1765231 上海农邦实业有限公司
8 1210010403 C栋 25 1765231 上海农NA邦实业有限公司
回答2:
Boolean select NaNs
, drop the last duplicated in name
and city
df[~(df.name.isna() & df.duplicated(subset = ['city', 'name'], keep = 'first'))]
id city name
0 1 bj NaN
1 2 bj jack
3 4 bj jim
4 5 sh NaN
6 7 sh steve
7 8 sh fiona
来源:https://stackoverflow.com/questions/65352814/groupby-and-drop-nan-rows-while-preserving-one-in-pandas