This question is based on another question I asked, where I didn\'t cover the problem entirely: Pandas - check if a string column contains a pair of strings
This is
This is my answer using comprehensions and zip
Note, this checks substrings in df1
c = df1.consumption.values.tolist()
f = df2.food.values.tolist()
a = df2.creature.values.tolist()
check = np.array([[fd in cs and cr in cs for fd, cr in zip(f, a)] for cs in c])
check.any(1)
array([ True, False, True, False, False, True, False, True, False], dtype=bool)
This is a pandas
version of what @MaxU did. Respect what he did... it is awesome!
X = df1.consumption.str.get_dummies(' ')
Y = (df2.creature + ' ' + df2.food).str.get_dummies(' ') \
.reindex_axis(X.columns, 1, fill_value=0)
# This is where you can see which rows from `df2` (columns)
# matched with which rows from `df1` (rows)
XY = X.dot(Y.T)
print(XY)
0 1 2 3
0 2 1 0 0
1 1 1 1 0
2 0 0 2 1
3 0 1 1 1
4 0 0 0 0
5 1 2 0 0
6 0 0 0 1
7 0 0 1 2
8 1 0 0 0
# return the desired `True`s and `False`s
XY.gt(1).any(1)
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
8 False
dtype: bool
naive testing
Consider this vectorized approach:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X = vect.fit_transform(df1.consumption)
Y = vect.transform(df2.creature + ' ' + df2.food)
res = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
Result:
In [67]: res
Out[67]: array([ True, False, True, False, False, True, False, True, False], dtype=bool)
Explanation:
In [68]: pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
Out[68]:
apple ate badger banana digs eats elephant gets giraffe grass huge in is likes loves monkey squirrel tree
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0
3 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0
5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
6 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0
7 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1
8 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0
In [69]: pd.DataFrame(Y.toarray(), columns=vect.get_feature_names())
Out[69]:
apple ate badger banana digs eats elephant gets giraffe grass huge in is likes loves monkey squirrel tree
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
UPDATE:
In [92]: df1['match'] = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
In [93]: df1
Out[93]:
consumption match
0 squirrel ate apple True
1 monkey likes apple False
2 monkey banana gets True
3 badger gets banana False
4 giraffe eats grass False
5 badger apple loves True
6 elephant is huge False
7 elephant eats banana tree True
8 squirrel digs in grass False
9 squirrel.eats/apple True # <----- NOTE