问题
I have a large dataframe with text that I want to use to find matches from a list of words (around 1k words in there).
I have managed to get the absence/presence of the word from the list in the dataframe, but it is also important to me to know which word matched. Sometimes there is exact match with more than one word from the list, I would like to have them all.
I tried to use the code below, but it gives me partial matches - syllables instead of full words.
#this is a code to recreate the initial DF
import pandas as pd
df_data= [['orange','0'],
['apple and lemon','1'],
['lemon and orange','1']]
df= pd.DataFrame(df_data,columns=['text','match','exact word'])
Initial DF:
text match
orange 0
apple and lemon 1
lemon and orange 1
This is the list of words I need to match
exactmatch = ['apple', 'lemon']
Expected result:
text match exact words
orange 0 0
apple and lemon 1 'apple','lemon'
lemon and orange 1 'lemon'
This is what I've tried:
# for some rows it gives me words I want,
#and for some it gives me parts of the word
#regex attempt 1, gives me partial matches (syllables or single letters)
pattern1 = '|'.join(exactmatch)
df['contains'] = df['text'].str.extract("(" + "|".join(exactmatch)
+")", expand=False)
#regex attempt 2 - this gives me an error - unexpected EOL
df['contains'] = df['text'].str.extractall
("(" + "|".join(exactmatch) +")").unstack().apply(','.join, 1)
#TypeError: ('sequence item 1: expected str instance, float found',
#'occurred at index 2')
#no regex attempt, does not give me matches if the word is in there
lst = list(df['text'])
match = []
for w in lst:
if w in exactmatch:
match.append(w)
break
回答1:
Use str.findall
Ex:
exactmatch = ['apple', 'lemon']
df_data= [['orange'],['apple and lemon',],['lemon and orange'],]
df= pd.DataFrame(df_data,columns=['text'])
df['exact word'] = df["text"].str.findall(r"|".join(exactmatch)).apply(", ".join)
print(df)
Output:
text exact word
0 orange
1 apple and lemon apple, lemon
2 lemon and orange lemon
来源:https://stackoverflow.com/questions/57147634/how-to-extract-exact-matches-with-list-from-a-dataframe-column