Vectorized way to count occurrences of string in either of two columns

后端未结

关注

 4  677

一整个雨季 2021-01-05 03:57

I have a problem that is similar to this question, but just different enough that it can\'t be solved with the same solution...

I\'ve got two dataframes,

4条回答

有刺的猬 (楼主)

2021-01-05 04:25

Here's a solution where you effectively do the nested "in" loop by expanding dimensionality of ID from df2 to take advantage of NumPy broadcasting:

>>> def count_names(df1, df2):
...     names1, names2 = df1.values.T
...     v2 = df2.ID.values[:, None]
...     mask1 = v2 == names1
...     mask2 = v2 == names2
...     df2['count'] = np.logical_or(mask1, mask2).sum(axis=1)
...     return df2


>>> %timeit -r 5 -n 1000 count_names(df1, df2)
144 µs ± 10.4 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)

>>> %timeit -r 5 -n 1000 jp(df1, df2)
224 µs ± 15.5 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)

>>> %timeit -r 5 -n 1000 cs(df1, df2)
238 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit -r 5 -n 1000 wen(df1, df2)
921 µs ± 15.3 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)

The shape of the masks will be (len(df1), len(df2)).

0 讨论(0)

查看其它4个回答