Vectorized way to count occurrences of string in either of two columns

后端未结

关注

 4  676

I have a problem that is similar to this question, but just different enough that it can\'t be solved with the same solution...

I\'ve got two dataframes,

相关标签:

4条回答

温柔的废话

2021-01-05 04:04

The "either" part complicates things, but should still be doable.

Option 1
Since other users decided to turn this into a speed-race, here's mine:

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Option 2
(Original answer) stack based

c = df1.stack().groupby(level=0).value_counts().count(level=1)

Or,

c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()

Or,

v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)

And,

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 3
repeat-based Reshape and counting

v = pd.DataFrame({
        'i' : df1.values.reshape(-1, ), 
        'j' : df1.index.repeat(2)
    })
c = v.loc[~v.duplicated(), 'i'].value_counts()

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

Option 4
concat + mask

v = pd.concat(
    [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()

df2['count'] = df2.ID.map(v)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

0 讨论(0)

南方客

2021-01-05 04:21

By using get_dummies

pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
Out[614]: 
jack        3
jill        5
jane        8
joe         9
ben         7
beatrice    6
dtype: int64

I think this should be fast ...

from itertools import chain
from collections import Counter

pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]

0 讨论(0)

有刺的猬

2021-01-05 04:25

Here's a solution where you effectively do the nested "in" loop by expanding dimensionality of ID from df2 to take advantage of NumPy broadcasting:

>>> def count_names(df1, df2):
...     names1, names2 = df1.values.T
...     v2 = df2.ID.values[:, None]
...     mask1 = v2 == names1
...     mask2 = v2 == names2
...     df2['count'] = np.logical_or(mask1, mask2).sum(axis=1)
...     return df2


>>> %timeit -r 5 -n 1000 count_names(df1, df2)
144 µs ± 10.4 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)

>>> %timeit -r 5 -n 1000 jp(df1, df2)
224 µs ± 15.5 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)

>>> %timeit -r 5 -n 1000 cs(df1, df2)
238 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit -r 5 -n 1000 wen(df1, df2)
921 µs ± 15.3 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)

The shape of the masks will be (len(df1), len(df2)).

0 讨论(0)

轻奢々

2021-01-05 04:28

Below are a couple of ways based on numpy arrays. Benchmarking below.

Important: Take these results with a grain of salt. Remember, performance is dependent on your data, environment and hardware. In your choice, you should also consider readability / adaptability.

Categorical data: The superb performance with categorical data in jp2 (i.e. factorising strings to integers via an internal dictionary-like structure) is data-dependent, but if it works it should be applicable across all the below algorithms with good performance and memory benefits.

import pandas as pd
import numpy as np
from itertools import chain
from collections import Counter

# Tested on python 3.6.2 / pandas 0.20.3 / numpy 1.13.1

%timeit original(df1, df2)   # 48.4 ms per loop
%timeit jp1(df1, df2)        # 5.82 ms per loop
%timeit jp2(df1, df2)        # 2.20 ms per loop
%timeit brad(df1, df2)       # 7.83 ms per loop
%timeit cs1(df1, df2)        # 12.5 ms per loop
%timeit cs2(df1, df2)        # 17.4 ms per loop
%timeit cs3(df1, df2)        # 15.7 ms per loop
%timeit cs4(df1, df2)        # 10.7 ms per loop
%timeit wen1(df1, df2)       # 19.7 ms per loop
%timeit wen2(df1, df2)       # 32.8 ms per loop

def original(df1, df2):
    for idx,row in df2.iterrows():
        df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
    return df2

def jp1(df1, df2):
    for idx, item in enumerate(df2['ID']):
        df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
    return df2

def jp2(df1, df2):
    df2['ID'] = df2['ID'].astype('category')
    df1['ID_a'] = df1['ID_a'].astype('category')
    df1['ID_b'] = df1['ID_b'].astype('category')
    for idx, item in enumerate(df2['ID']):
        df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
    return df2

def brad(df1, df2):
    names1, names2 = df1.values.T
    v2 = df2.ID.values
    mask1 = v2 == names1[:, None]
    mask2 = v2 == names2[:, None]
    df2['count'] = np.logical_or(mask1, mask2).sum(axis=0)
    return df2

def cs1(df1, df2):
    c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
    df2['count'] = df2['ID'].map(Counter(c))
    return df2

def cs2(df1, df2):
    v = df1.stack().groupby(level=0).value_counts().count(level=1)
    df2['count'] = df2.ID.map(v)
    return df2

def cs3(df1, df2):
    v = pd.DataFrame({
            'i' : df1.values.reshape(-1, ), 
            'j' : df1.index.repeat(2)
        })
    c = v.loc[~v.duplicated(), 'i'].value_counts()

    df2['count'] = df2.ID.map(c)
    return df2

def cs4(df1, df2):
    v = pd.concat(
        [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
    ).value_counts()

    df2['count'] = df2.ID.map(v)
    return df2

def wen1(df1, df2):
    return pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]

def wen2(df1, df2):
    return pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]

Setup

import pandas as pd
import numpy as np

np.random.seed(42)

names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']

df1 = pd.DataFrame({'ID_a':np.random.choice(names, 10000), 'ID_b':np.random.choice(names, 10000)})    

df2 = pd.DataFrame({'ID':names})

df2['count'] = 0

0 讨论(0)