How to conditionally remove duplicates from a pandas dataframe

后端 未结 3 2090
梦如初夏
梦如初夏 2021-01-02 02:41

Consider the following dataframe

import pandas as pd
df = pd.DataFrame({\'A\' : [1, 2, 3, 3, 4, 4, 5, 6, 7],
                   \'B\' : [\'a\',\'b\',\'c\',\'         


        
相关标签:
3条回答
  • 2021-01-02 02:50

    Here's an alternative:

    df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))]
    #    A  B Col_1  Col_2
    # 0  1  a   NaN      2
    # 1  2  b     A      2
    # 2  3  c     A      3
    # 4  4  d     B      3
    # 6  5  e     B      4
    # 7  6  f   NaN      4
    # 8  7  g   NaN      5
    

    This uses the bitwise "not" operator ~ to negate rows that meet the joint condition of being a duplicate row (the argument keep=False causes the method to evaluate to True for all non-unique rows) and containing at least one null value. So where the expression df[['A', 'B']].duplicated(keep=False) returns this Series:

    # 0    False
    # 1    False
    # 2     True
    # 3     True
    # 4     True
    # 5     True
    # 6    False
    # 7    False
    # 8    False
    

    ...and the expression df.isnull().any(axis=1) returns this Series:

    # 0     True
    # 1    False
    # 2    False
    # 3     True
    # 4    False
    # 5     True
    # 6    False
    # 7     True
    # 8     True
    

    ... we wrap both in parentheses (required by Pandas syntax whenever using multiple expressions in indexing operations), and then wrap them in parentheses again so that we can negate the entire expression (i.e. ~( ... )), like so:

    ~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5)
    
    # 0     True
    # 1     True
    # 2     True
    # 3    False
    # 4     True
    # 5    False
    # 6     True
    # 7     True
    # 8    False
    

    You can build more complex conditions with further use of the logical operators & and | (the "or" operator). As with SQL, group your conditions as necessary with additional parentheses; for instance, filter based on the logic "both condition X AND condition Y are true, or condition Z is true" with df[ ( (X) & (Y) ) | (Z) ].

    0 讨论(0)
  • 2021-01-02 03:01

    If the goal is to only drop the NaN duplicates, a slightly more involved solution is needed.

    First, sort on A, B, and Col_1, so NaNs are moved to the bottom for each group. Then call df.drop_duplicates with keep=first:

    out = df.sort_values(['A', 'B', 'Col_1']).drop_duplicates(['A', 'B'], keep='first')
    print(out)
    
       A  B Col_1  Col_2
    0  1  a   NaN      2
    1  2  b     A      2
    2  3  c     A      3
    4  4  d     B      3
    6  5  e     B      4
    7  6  f   NaN      4
    8  7  g   NaN      5
    
    0 讨论(0)
  • 2021-01-02 03:06

    Or you can just using first(), by using the first , will give back the first notnull value, so the order of original input does not really matter.

    df.groupby(['A','B']).first()
    
    Out[180]: 
        Col_1  Col_2
    A B             
    1 a   NaN      2
    2 b     A      2
    3 c     A      3
    4 d     B      3
    5 e     B      4
    6 f   NaN      4
    7 g   NaN      5
    
    0 讨论(0)
提交回复
热议问题