Vectorized way to count occurrences of string in either of two columns

后端 未结 4 676
一整个雨季
一整个雨季 2021-01-05 03:57

I have a problem that is similar to this question, but just different enough that it can\'t be solved with the same solution...

I\'ve got two dataframes,

相关标签:
4条回答
  • 2021-01-05 04:04

    The "either" part complicates things, but should still be doable.


    Option 1
    Since other users decided to turn this into a speed-race, here's mine:

    from collections import Counter
    from itertools import chain
    
    c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
    df2['count'] = df2['ID'].map(Counter(c))
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    

    176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Option 2
    (Original answer) stack based

    c = df1.stack().groupby(level=0).value_counts().count(level=1)
    

    Or,

    c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()
    

    Or,

    v = df1.stack()
    c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
    # c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)
    

    And,

    df2['count'] = df2.ID.map(c)
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    

    Option 3
    repeat-based Reshape and counting

    v = pd.DataFrame({
            'i' : df1.values.reshape(-1, ), 
            'j' : df1.index.repeat(2)
        })
    c = v.loc[~v.duplicated(), 'i'].value_counts()
    
    df2['count'] = df2.ID.map(c)
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    

    Option 4
    concat + mask

    v = pd.concat(
        [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
    ).value_counts()
    
    df2['count'] = df2.ID.map(v)
    df2
    
             ID  count
    0      jack      3
    1      jill      5
    2      jane      8
    3       joe      9
    4       ben      7
    5  beatrice      6
    
    0 讨论(0)
  • 2021-01-05 04:21

    By using get_dummies

    pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
    Out[614]: 
    jack        3
    jill        5
    jane        8
    joe         9
    ben         7
    beatrice    6
    dtype: int64
    

    I think this should be fast ...

    from itertools import chain
    from collections import Counter
    
    pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]
    
    0 讨论(0)
  • 2021-01-05 04:25

    Here's a solution where you effectively do the nested "in" loop by expanding dimensionality of ID from df2 to take advantage of NumPy broadcasting:

    >>> def count_names(df1, df2):
    ...     names1, names2 = df1.values.T
    ...     v2 = df2.ID.values[:, None]
    ...     mask1 = v2 == names1
    ...     mask2 = v2 == names2
    ...     df2['count'] = np.logical_or(mask1, mask2).sum(axis=1)
    ...     return df2
    
    
    >>> %timeit -r 5 -n 1000 count_names(df1, df2)
    144 µs ± 10.4 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
    
    >>> %timeit -r 5 -n 1000 jp(df1, df2)
    224 µs ± 15.5 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
    
    >>> %timeit -r 5 -n 1000 cs(df1, df2)
    238 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    >>> %timeit -r 5 -n 1000 wen(df1, df2)
    921 µs ± 15.3 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
    

    The shape of the masks will be (len(df1), len(df2)).

    0 讨论(0)
  • 2021-01-05 04:28

    Below are a couple of ways based on numpy arrays. Benchmarking below.

    Important: Take these results with a grain of salt. Remember, performance is dependent on your data, environment and hardware. In your choice, you should also consider readability / adaptability.

    Categorical data: The superb performance with categorical data in jp2 (i.e. factorising strings to integers via an internal dictionary-like structure) is data-dependent, but if it works it should be applicable across all the below algorithms with good performance and memory benefits.

    import pandas as pd
    import numpy as np
    from itertools import chain
    from collections import Counter
    
    # Tested on python 3.6.2 / pandas 0.20.3 / numpy 1.13.1
    
    %timeit original(df1, df2)   # 48.4 ms per loop
    %timeit jp1(df1, df2)        # 5.82 ms per loop
    %timeit jp2(df1, df2)        # 2.20 ms per loop
    %timeit brad(df1, df2)       # 7.83 ms per loop
    %timeit cs1(df1, df2)        # 12.5 ms per loop
    %timeit cs2(df1, df2)        # 17.4 ms per loop
    %timeit cs3(df1, df2)        # 15.7 ms per loop
    %timeit cs4(df1, df2)        # 10.7 ms per loop
    %timeit wen1(df1, df2)       # 19.7 ms per loop
    %timeit wen2(df1, df2)       # 32.8 ms per loop
    
    def original(df1, df2):
        for idx,row in df2.iterrows():
            df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
        return df2
    
    def jp1(df1, df2):
        for idx, item in enumerate(df2['ID']):
            df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
        return df2
    
    def jp2(df1, df2):
        df2['ID'] = df2['ID'].astype('category')
        df1['ID_a'] = df1['ID_a'].astype('category')
        df1['ID_b'] = df1['ID_b'].astype('category')
        for idx, item in enumerate(df2['ID']):
            df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
        return df2
    
    def brad(df1, df2):
        names1, names2 = df1.values.T
        v2 = df2.ID.values
        mask1 = v2 == names1[:, None]
        mask2 = v2 == names2[:, None]
        df2['count'] = np.logical_or(mask1, mask2).sum(axis=0)
        return df2
    
    def cs1(df1, df2):
        c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
        df2['count'] = df2['ID'].map(Counter(c))
        return df2
    
    def cs2(df1, df2):
        v = df1.stack().groupby(level=0).value_counts().count(level=1)
        df2['count'] = df2.ID.map(v)
        return df2
    
    def cs3(df1, df2):
        v = pd.DataFrame({
                'i' : df1.values.reshape(-1, ), 
                'j' : df1.index.repeat(2)
            })
        c = v.loc[~v.duplicated(), 'i'].value_counts()
    
        df2['count'] = df2.ID.map(c)
        return df2
    
    def cs4(df1, df2):
        v = pd.concat(
            [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
        ).value_counts()
    
        df2['count'] = df2.ID.map(v)
        return df2
    
    def wen1(df1, df2):
        return pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
    
    def wen2(df1, df2):
        return pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]
    

    Setup

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    
    names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
    
    df1 = pd.DataFrame({'ID_a':np.random.choice(names, 10000), 'ID_b':np.random.choice(names, 10000)})    
    
    df2 = pd.DataFrame({'ID':names})
    
    df2['count'] = 0
    
    0 讨论(0)
提交回复
热议问题