Extracting sentences using pandas with specific words

问题

I have a excel file with a text column. All I need to do is to extract the sentences from the text column for each row with specific words.

I have tried using defining a function.

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
    sentences=sent_tokenize(text)
    return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

But can someone please help me if I have to find the sentence with multiple specific words like snakes, venomous, anaconda. The sentence should have at least one word. I am not able to work around with nltk.tokenize with multiple words.

To be searched words = ['snakes','venomous','anaconda']

Input Excel file :

                    text
     1.  Snakes are venomous. Anaconda is venomous.
     2.  Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
     3.  Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an    anaconda.Because it is venomous.
     4.  Python is dangerous too.

Desired Output :

Column called Context appended to the text column above. Context column should be like :

 1.  [Snakes are venomous.] [Anaconda is venomous.]
 2.  [Anaconda lives in Amazon.] [It is venomous.]
 3.  [Snakes,snakes,snakes everywhere!] [The least I expect is an    anaconda.Because it is venomous.]
 4.  NULL

Thanks in advance.

回答1:

Here's how:

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object

You see that there's a couple of issues, because the sent_tokenizer didn't do it's job properly because of the punctuation.

Update: handling plurals.

Here's an updated df:

text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes


df = pd.read_clipboard(sep='0')

We can use a stemmer (Wikipedia), such as the PorterStemmer.

from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()

First, let's Stem and lowercase the searched words:

searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']

Now we can do revamp the above to include stemming as well:

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.]
3    []
4    [I have snakes]
Name: text, dtype: object

If you only want substring matching, make sure searched_words is singular, not plural.

 print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any([(w2.lower() in w.lower()) for w in word_tokenize(sent)
                                   for w2 in searched_words])
                                ])
 )

By the way, this is the point where I'd probably create a function with regular for loops, this lambda with list comprehensions is getting out of hands.

来源：https://stackoverflow.com/questions/40861341/extracting-sentences-using-pandas-with-specific-words

标签

python

pandas

nltk