Set Union in pandas

前端 未结 2 963
心在旅途
心在旅途 2020-12-19 06:59

I have two columns which I stored sets in my dataframe.

I want to perform set union on the two columns using fast vectorized operation

df[\'union\']          


        
相关标签:
2条回答
  • 2020-12-19 07:45

    This is the best I could come up with:

    # method 1
    df.apply(lambda x: x.set1.union(x.set2), axis=1)
    
    # method 2
    df.applymap(list).sum(1).apply(set)
    

    Wow!

    I expected the method 2 to be quicker. Not so!

    Example

    df = pd.DataFrame([[{1, 2, 3}, {3, 4, 5}] for _ in range(3)],
                      columns=list('AB'))
    df
    

    df.apply(lambda x: x.set1.union(x.set2), axis=1)
    
    0    {1, 2, 3, 4, 5}
    1    {1, 2, 3, 4, 5}
    2    {1, 2, 3, 4, 5}
    
    0 讨论(0)
  • 2020-12-19 08:02

    For these operations pure Python may be more efficient.

    %timeit pd.Series([set1.union(set2) for set1, set2 in zip(df['A'], df['B'])])
    10 loops, best of 3: 43.3 ms per loop
    
    %timeit df.apply(lambda x: x.A.union(x.B), axis=1)
    1 loop, best of 3: 2.6 s per loop
    

    If we could use +, it would probably take half the time (inheritance may not worth it):

    %timeit df['A'] - df['B']
    10 loops, best of 3: 22.1 ms per loop
    
    %timeit pd.Series([set1.difference(set2) for set1, set2 in zip(df['A'], df['B'])])
    10 loops, best of 3: 35.7 ms per loop
    

    DataFrame for timings:

    import pandas as pd
    import numpy as np
    l1 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]
    l2 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]
    
    df = pd.DataFrame({'A': l1, 'B': l2})
    
    0 讨论(0)
提交回复
热议问题