Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe

后端 未结 2 824
情书的邮戳
情书的邮戳 2020-12-30 18:23

This question is based on another question I asked, where I didn\'t cover the problem entirely: Pandas - check if a string column contains a pair of strings

This is

相关标签:
2条回答
  • 2020-12-30 18:27

    This is my answer using comprehensions and zip
    Note, this checks substrings in df1

    c = df1.consumption.values.tolist()
    f = df2.food.values.tolist()
    a = df2.creature.values.tolist() 
    
    check = np.array([[fd in cs and cr in cs for fd, cr in zip(f, a)] for cs in c])
    
    check.any(1)
    
    array([ True, False,  True, False, False,  True, False,  True, False], dtype=bool)
    

    This is a pandas version of what @MaxU did. Respect what he did... it is awesome!

    X = df1.consumption.str.get_dummies(' ')
    Y = (df2.creature + ' ' + df2.food).str.get_dummies(' ') \
        .reindex_axis(X.columns, 1, fill_value=0)
    
    # This is where you can see which rows from `df2` (columns)
    # matched with which rows from `df1` (rows) 
    XY = X.dot(Y.T)
    
    print(XY)
    
       0  1  2  3
    0  2  1  0  0
    1  1  1  1  0
    2  0  0  2  1
    3  0  1  1  1
    4  0  0  0  0
    5  1  2  0  0
    6  0  0  0  1
    7  0  0  1  2
    8  1  0  0  0
    
    # return the desired `True`s and `False`s
    
    XY.gt(1).any(1)
    
    0     True
    1    False
    2     True
    3    False
    4    False
    5     True
    6    False
    7     True
    8    False
    dtype: bool
    

    naive testing

    0 讨论(0)
  • 2020-12-30 18:38

    Consider this vectorized approach:

    from sklearn.feature_extraction.text import CountVectorizer
    
    vect = CountVectorizer()
    
    X = vect.fit_transform(df1.consumption)
    Y = vect.transform(df2.creature + ' ' + df2.food)
    
    res = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
    

    Result:

    In [67]: res
    Out[67]: array([ True, False,  True, False, False,  True, False,  True, False], dtype=bool)
    

    Explanation:

    In [68]: pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
    Out[68]:
       apple  ate  badger  banana  digs  eats  elephant  gets  giraffe  grass  huge  in  is  likes  loves  monkey  squirrel  tree
    0      1    1       0       0     0     0         0     0        0      0     0   0   0      0      0       0         1     0
    1      1    0       0       0     0     0         0     0        0      0     0   0   0      1      0       1         0     0
    2      0    0       0       1     0     0         0     1        0      0     0   0   0      0      0       1         0     0
    3      0    0       1       1     0     0         0     1        0      0     0   0   0      0      0       0         0     0
    4      0    0       0       0     0     1         0     0        1      1     0   0   0      0      0       0         0     0
    5      1    0       1       0     0     0         0     0        0      0     0   0   0      0      1       0         0     0
    6      0    0       0       0     0     0         1     0        0      0     1   0   1      0      0       0         0     0
    7      0    0       0       1     0     1         1     0        0      0     0   0   0      0      0       0         0     1
    8      0    0       0       0     1     0         0     0        0      1     0   1   0      0      0       0         1     0
    
    In [69]: pd.DataFrame(Y.toarray(), columns=vect.get_feature_names())
    Out[69]:
       apple  ate  badger  banana  digs  eats  elephant  gets  giraffe  grass  huge  in  is  likes  loves  monkey  squirrel  tree
    0      1    0       0       0     0     0         0     0        0      0     0   0   0      0      0       0         1     0
    1      1    0       1       0     0     0         0     0        0      0     0   0   0      0      0       0         0     0
    2      0    0       0       1     0     0         0     0        0      0     0   0   0      0      0       1         0     0
    3      0    0       0       1     0     0         1     0        0      0     0   0   0      0      0       0         0     0
    

    UPDATE:

    In [92]: df1['match'] = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
    
    In [93]: df1
    Out[93]:
                     consumption  match
    0         squirrel ate apple   True
    1         monkey likes apple  False
    2         monkey banana gets   True
    3         badger gets banana  False
    4         giraffe eats grass  False
    5         badger apple loves   True
    6           elephant is huge  False
    7  elephant eats banana tree   True
    8     squirrel digs in grass  False
    9        squirrel.eats/apple   True   # <----- NOTE
    
    0 讨论(0)
提交回复
热议问题