I have a problem that is similar to this question, but just different enough that it can\'t be solved with the same solution...
I\'ve got two dataframes,
The "either" part complicates things, but should still be doable.
Option 1
Since other users decided to turn this into a speed-race, here's mine:
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Option 2
(Original answer) stack
based
c = df1.stack().groupby(level=0).value_counts().count(level=1)
Or,
c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()
Or,
v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)
And,
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
Option 3
repeat
-based Reshape and counting
v = pd.DataFrame({
'i' : df1.values.reshape(-1, ),
'j' : df1.index.repeat(2)
})
c = v.loc[~v.duplicated(), 'i'].value_counts()
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
Option 4
concat
+ mask
v = pd.concat(
[df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()
df2['count'] = df2.ID.map(v)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
By using get_dummies
pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
Out[614]:
jack 3
jill 5
jane 8
joe 9
ben 7
beatrice 6
dtype: int64
I think this should be fast ...
from itertools import chain
from collections import Counter
pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]
Here's a solution where you effectively do the nested "in" loop by expanding dimensionality of ID
from df2
to take advantage of NumPy broadcasting:
>>> def count_names(df1, df2):
... names1, names2 = df1.values.T
... v2 = df2.ID.values[:, None]
... mask1 = v2 == names1
... mask2 = v2 == names2
... df2['count'] = np.logical_or(mask1, mask2).sum(axis=1)
... return df2
>>> %timeit -r 5 -n 1000 count_names(df1, df2)
144 µs ± 10.4 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
>>> %timeit -r 5 -n 1000 jp(df1, df2)
224 µs ± 15.5 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
>>> %timeit -r 5 -n 1000 cs(df1, df2)
238 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -r 5 -n 1000 wen(df1, df2)
921 µs ± 15.3 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
The shape of the masks will be (len(df1), len(df2))
.
Below are a couple of ways based on numpy
arrays. Benchmarking below.
Important: Take these results with a grain of salt. Remember, performance is dependent on your data, environment and hardware. In your choice, you should also consider readability / adaptability.
Categorical data: The superb performance with categorical data in jp2
(i.e. factorising strings to integers via an internal dictionary-like structure) is data-dependent, but if it works it should be applicable across all the below algorithms with good performance and memory benefits.
import pandas as pd
import numpy as np
from itertools import chain
from collections import Counter
# Tested on python 3.6.2 / pandas 0.20.3 / numpy 1.13.1
%timeit original(df1, df2) # 48.4 ms per loop
%timeit jp1(df1, df2) # 5.82 ms per loop
%timeit jp2(df1, df2) # 2.20 ms per loop
%timeit brad(df1, df2) # 7.83 ms per loop
%timeit cs1(df1, df2) # 12.5 ms per loop
%timeit cs2(df1, df2) # 17.4 ms per loop
%timeit cs3(df1, df2) # 15.7 ms per loop
%timeit cs4(df1, df2) # 10.7 ms per loop
%timeit wen1(df1, df2) # 19.7 ms per loop
%timeit wen2(df1, df2) # 32.8 ms per loop
def original(df1, df2):
for idx,row in df2.iterrows():
df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
return df2
def jp1(df1, df2):
for idx, item in enumerate(df2['ID']):
df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
return df2
def jp2(df1, df2):
df2['ID'] = df2['ID'].astype('category')
df1['ID_a'] = df1['ID_a'].astype('category')
df1['ID_b'] = df1['ID_b'].astype('category')
for idx, item in enumerate(df2['ID']):
df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
return df2
def brad(df1, df2):
names1, names2 = df1.values.T
v2 = df2.ID.values
mask1 = v2 == names1[:, None]
mask2 = v2 == names2[:, None]
df2['count'] = np.logical_or(mask1, mask2).sum(axis=0)
return df2
def cs1(df1, df2):
c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
return df2
def cs2(df1, df2):
v = df1.stack().groupby(level=0).value_counts().count(level=1)
df2['count'] = df2.ID.map(v)
return df2
def cs3(df1, df2):
v = pd.DataFrame({
'i' : df1.values.reshape(-1, ),
'j' : df1.index.repeat(2)
})
c = v.loc[~v.duplicated(), 'i'].value_counts()
df2['count'] = df2.ID.map(c)
return df2
def cs4(df1, df2):
v = pd.concat(
[df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()
df2['count'] = df2.ID.map(v)
return df2
def wen1(df1, df2):
return pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
def wen2(df1, df2):
return pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]
Setup
import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 10000), 'ID_b':np.random.choice(names, 10000)})
df2 = pd.DataFrame({'ID':names})
df2['count'] = 0