问题
I have a excel file with a text column. All I need to do is to extract the sentences from the text column for each row with specific words.
I have tried using defining a function.
import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
#################Reading in excel file#####################
str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")
################# Defining a function #####################
def sentence_finder(text,word):
sentences=sent_tokenize(text)
return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))
################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")
But can someone please help me if I have to find the sentence with multiple specific words like snakes
, venomous
, anaconda
. The sentence should have at least one word. I am not able to work around with nltk.tokenize
with multiple words.
To be searched words = ['snakes','venomous','anaconda']
Input Excel file :
text
1. Snakes are venomous. Anaconda is venomous.
2. Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
3. Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an anaconda.Because it is venomous.
4. Python is dangerous too.
Desired Output :
Column called Context appended to the text column above. Context column should be like :
1. [Snakes are venomous.] [Anaconda is venomous.]
2. [Anaconda lives in Amazon.] [It is venomous.]
3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.]
4. NULL
Thanks in advance.
回答1:
Here's how:
In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if w.lower() in searched_words)])
0 [Snakes are venomous., Anaconda is venomous.]
1 [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2 [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3 []
Name: text, dtype: object
You see that there's a couple of issues, because the sent_tokenizer
didn't do it's job properly because of the punctuation.
Update: handling plurals.
Here's an updated df:
text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes
df = pd.read_clipboard(sep='0')
We can use a stemmer (Wikipedia), such as the PorterStemmer.
from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()
First, let's Stem and lowercase the searched words:
searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words
> ['snake', 'venom', 'anaconda']
Now we can do revamp the above to include stemming as well:
print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
0 [Snakes are venomous., Anaconda is venomous.]
1 [Anaconda lives in Amazon., It is venomous.]
2 [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3 []
4 [I have snakes]
Name: text, dtype: object
If you only want substring matching, make sure searched_words is singular, not plural.
print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
for w2 in searched_words])
])
)
By the way, this is the point where I'd probably create a function with regular for loops, this lambda with list comprehensions is getting out of hands.
来源:https://stackoverflow.com/questions/40861341/extracting-sentences-using-pandas-with-specific-words