removing stop words using spacy

前端未结

关注

 1  984

长情又很酷

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things:

Tokenize
Lemmantize
Remove stop words
<

相关标签:

1条回答

被撕碎了的回忆

2021-02-09 19:15

import spacy
import pandas as pd

# Load spacy model
nlp = spacy.load('en', parser=False, entity=False)        

# New stop words list 
customize_stop_words = [
    'attach'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True


# Test data
df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                   "eating and sleeping"]})

# Convert each row into spacy document and return the lemma of the tokens in 
# the document if it is not a sotp word. Finally join the lemmas into as a string
df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                          " ".join(token.lemma_ for token in nlp(text) 
                                                   if not token.is_stop))

print (df)

Output:

   Sumcription                                   Sumcription_lema
0  attach poster on the wall because it is cool  poster wall cool
1                           eating and sleeping         eat sleep

0 讨论(0)