Python - Delete duplicates in a dataframe based on two columns combinations?

前端 未结 3 1201
名媛妹妹
名媛妹妹 2020-11-29 10:53

I have a dataframe with 3 columns in Python:

Name1 Name2 Value
Juan  Ale   1
Ale   Juan  1

and would like to eliminate the duplicates based

相关标签:
3条回答
  • 2020-11-29 11:38

    You can convert to frozenset and use pd.DataFrame.duplicated.

    res = df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
    
    print(res)
    
      Name1 Name2  Value
    0  Juan   Ale      1
    

    frozenset is necessary instead of set since duplicated uses hashing to check for duplicates.

    Scales better with columns than rows. For a large number of rows, use @Wen's sort-based algorithm.

    0 讨论(0)
  • 2020-11-29 11:45

    Know Im kinda late for this question but giving my contribution anyway :)

    You can also use get_dummies and add for a good way of creating hashable rows

    df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]
    

    Times are not as good as @Wen's answer, but it isstill way faster than apply+frozen_set

    df=pd.concat([df]*1000000)
    %timeit df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]
    1.8 s ± 85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit df[pd.DataFrame(np.sort(df[['a','b']].values,1)).duplicated()]
    1.26 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %timeit df[~df[['a', 'b']].apply(frozenset, axis=1).duplicated()]
    1min 9s ± 684 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    0 讨论(0)
  • 2020-11-29 11:50

    By using np.sort with duplicated

    df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
    Out[614]: 
      Name1 Name2  Value
    1   Ale  Juan      1
    

    Performance

    df=pd.concat([df]*100000)
    
    %timeit df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
    10 loops, best of 3: 69.3 ms per loop
    %timeit df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
    1 loop, best of 3: 3.72 s per loop
    
    0 讨论(0)
提交回复
热议问题