Groupby and drop NaN rows while preserving one in Pandas

余生长醉 提交于 2021-02-17 03:33:05

问题


Given a test dataset as follows:

   id city   name
0   1   bj    NaN
1   2   bj   jack
2   3   bj    NaN
3   4   bj    jim
4   5   sh    NaN
5   6   sh    NaN
6   7   sh  steve
7   8   sh  fiona
8   9   sh    NaN

How could I groupby city and drop NaN rows for name while preserving one only for each group? Many thanks.

The expected result will like this:

   id city   name
0   1   bj    NaN
1   2   bj   jack
2   4   bj    jim
3   5   sh    NaN
4   7   sh  steve
5   8   sh  fiona

New dataset read by df = pd.read_clipboard(na_filter = False) from excel file, please note N/A should not be considered as NaN:

      newcode build_name  floor  rent_id      rent_name
0  1210010403         C栋     25  1765228   (株)有延商店上海事务所
1  1210010403         C栋     25  1765229            N/A
2  1210010403         C栋     25  1765229            N/A
3  1210010403         C栋     25  1765229            N/A
4  1210010403         C栋     25  1765230  上海皇瑾文化传媒有限公司 
5  1210010403         C栋     25  1765229            N/A
6  1210010403         C栋     25  1765231     上海农邦实业有限公司
7  1210010403         C栋     25  1765232            N/A
8  1210010403         C栋     25  1765231   上海农NA邦实业有限公司

Code: df[df['rent_name'].ne('N/A') | ~df.duplicated(subset=['newcode', 'build_name', 'floor'])], same result with df[~(df['rent_name'].eq('N/A') & df.duplicated(subset = ['newcode', 'build_name', 'floor'], keep = 'first'))]

Out:

      newcode build_name  floor  rent_id      rent_name
0  1210010403         C栋     25  1765228   (株)有延商店上海事务所
4  1210010403         C栋     25  1765230  上海皇瑾文化传媒有限公司 
6  1210010403         C栋     25  1765231     上海农邦实业有限公司
8  1210010403         C栋     25  1765231   上海农NA邦实业有限公司

You can see one N/A row is missing in the result, I don't know why.

Desired output:

      newcode build_name  floor  rent_id      rent_name
0  1210010403         C栋     25  1765228   (株)有延商店上海事务所
1  1210010403         C栋     25  1765229            N/A
4  1210010403         C栋     25  1765230  上海皇瑾文化传媒有限公司 
6  1210010403         C栋     25  1765231     上海农邦实业有限公司
8  1210010403         C栋     25  1765231   上海农NA邦实业有限公司

回答1:


Chain condition for test not missing values or first duplicated values per city, name:

df = df[df['name'].notna() | ~df.duplicated(subset=['city', 'name'])]
print(df)
   id city   name
0   1   bj    NaN
1   2   bj   jack
3   4   bj    jim
4   5   sh    NaN
6   7   sh  steve
7   8   sh  fiona

EDIT: For test strings N/A use Series.ne:

df = df[df['name'].ne('N/A') | ~df.duplicated(subset=['city', 'name'])]
print(df)
   id city   name
0   1   bj    N/A
1   2   bj   jack
3   4   bj    jim
4   5   sh    N/A
6   7   sh  steve
7   8   sh  fiona

If want test multiple values use Series.isin with inverted mask:

df = df[~df['name'].isin(['N/N','N/A']) | ~df.duplicated(subset=['city', 'name'])]
print(df)

   id city   name
0   1   bj    N/A
1   2   bj   jack
3   4   bj    jim
4   5   sh    N/A
6   7   sh  steve
7   8   sh  fiona

EDIT:

df = df[df['name'].notna() | ~df.duplicated(subset=['city', 'name'])]
print(df)
   id city   name
0   1   bj    NaN
1   2   bj   jack
3   4   bj    jim
4   5   sh    NaN
6   7   sh  steve
7   8   sh  fiona
9  10   gz    NaN

EDIT1: For test duplicated is necessary add column with NaN, here rent_name:

df =  df[df['rent_name'].ne('N/A') | 
         ~df.duplicated(subset=['newcode', 'build_name', 'floor', 'rent_name'])]
print (df)
    
      newcode build_name  floor  rent_id     rent_name
0  1210010403         C栋     25  1765228  (株)有延商店上海事务所
1  1210010403         C栋     25  1765229           N/A
4  1210010403         C栋     25  1765230  上海皇瑾文化传媒有限公司
6  1210010403         C栋     25  1765231    上海农邦实业有限公司
8  1210010403         C栋     25  1765231  上海农NA邦实业有限公司



回答2:


Boolean select NaNs, drop the last duplicated in name and city

df[~(df.name.isna() & df.duplicated(subset = ['city', 'name'], keep = 'first'))]

    id city   name
0   1   bj    NaN
1   2   bj   jack
3   4   bj    jim
4   5   sh    NaN
6   7   sh  steve
7   8   sh  fiona


来源:https://stackoverflow.com/questions/65352814/groupby-and-drop-nan-rows-while-preserving-one-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!