How to extract exact matches with list from a dataframe column?

自作多情 提交于 2021-02-17 05:06:51

问题


I have a large dataframe with text that I want to use to find matches from a list of words (around 1k words in there).

I have managed to get the absence/presence of the word from the list in the dataframe, but it is also important to me to know which word matched. Sometimes there is exact match with more than one word from the list, I would like to have them all.

I tried to use the code below, but it gives me partial matches - syllables instead of full words.

#this is a code to recreate the initial DF

import pandas as pd

df_data= [['orange','0'],
['apple and lemon','1'],
['lemon and orange','1']]

df= pd.DataFrame(df_data,columns=['text','match','exact word'])

Initial DF:

 text                 match
 orange               0
 apple and lemon      1
 lemon and orange     1

This is the list of words I need to match

 exactmatch = ['apple', 'lemon']

Expected result:

 text                    match  exact words
 orange                    0         0 
 apple and lemon           1        'apple','lemon'
 lemon and orange          1        'lemon'

This is what I've tried:

# for some rows it gives me words I want, 
#and for some it gives me parts of the word

#regex attempt 1, gives me partial matches (syllables or single letters)

pattern1 = '|'.join(exactmatch)
df['contains'] = df['text'].str.extract("(" + "|".join(exactmatch) 
+")", expand=False)

#regex attempt 2 - this gives me an error - unexpected EOL

df['contains'] = df['text'].str.extractall
("(" + "|".join(exactmatch) +")").unstack().apply(','.join, 1)

#TypeError: ('sequence item 1: expected str instance, float found', 
#'occurred at index 2')

#no regex attempt, does not give me matches if the word is in there

lst = list(df['text'])
match = []
for w in lst:
 if w in exactmatch:
    match.append(w)
    break

回答1:


Use str.findall

Ex:

exactmatch = ['apple', 'lemon']
df_data= [['orange'],['apple and lemon',],['lemon and orange'],]

df= pd.DataFrame(df_data,columns=['text'])
df['exact word'] = df["text"].str.findall(r"|".join(exactmatch)).apply(", ".join)
print(df)

Output:

               text    exact word
0            orange              
1   apple and lemon  apple, lemon
2  lemon and orange         lemon


来源:https://stackoverflow.com/questions/57147634/how-to-extract-exact-matches-with-list-from-a-dataframe-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!