removing stop words using spacy

前端 未结 1 984
长情又很酷
长情又很酷 2021-02-09 18:24

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things:

  1. Tokenize
  2. Lemmantize
  3. Remove stop words

    <
相关标签:
1条回答
  • 2021-02-09 19:15
    import spacy
    import pandas as pd
    
    # Load spacy model
    nlp = spacy.load('en', parser=False, entity=False)        
    
    # New stop words list 
    customize_stop_words = [
        'attach'
    ]
    
    # Mark them as stop words
    for w in customize_stop_words:
        nlp.vocab[w].is_stop = True
    
    
    # Test data
    df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                       "eating and sleeping"]})
    
    # Convert each row into spacy document and return the lemma of the tokens in 
    # the document if it is not a sotp word. Finally join the lemmas into as a string
    df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                              " ".join(token.lemma_ for token in nlp(text) 
                                                       if not token.is_stop))
    
    print (df)
    

    Output:

       Sumcription                                   Sumcription_lema
    0  attach poster on the wall because it is cool  poster wall cool
    1                           eating and sleeping         eat sleep
    
    0 讨论(0)
提交回复
热议问题