Python remove stop words from pandas dataframe

后端 未结 4 1330
走了就别回头了
走了就别回头了 2020-11-29 02:08

I want to remove the stop words from my column \"tweets\". How do I iterative over each row and each item?

pos_tweets = [(\'I love this car\', \'positive\'),         


        
相关标签:
4条回答
  • 2020-11-29 02:49

    If you would like something simple but not get back a list of words:

    test["tweet"].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop))
    

    Where stop is defined as OP did.

    from nltk.corpus import stopwords
    stop = stopwords.words('english')
    
    0 讨论(0)
  • 2020-11-29 02:52

    We can import stopwords from nltk.corpus as below. With that, We exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.

    # Import stopwords with nltk.
    from nltk.corpus import stopwords
    stop = stopwords.words('english')
    
    pos_tweets = [('I love this car', 'positive'),
        ('This view is amazing', 'positive'),
        ('I feel great this morning', 'positive'),
        ('I am so excited about the concert', 'positive'),
        ('He is my best friend', 'positive')]
    
    test = pd.DataFrame(pos_tweets)
    test.columns = ["tweet","class"]
    
    # Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
    test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    print(test)
    # Out[40]:
    #                                tweet     class tweet_without_stopwords
    # 0                    I love this car  positive              I love car
    # 1               This view is amazing  positive       This view amazing
    # 2          I feel great this morning  positive    I feel great morning
    # 3  I am so excited about the concert  positive       I excited concert
    # 4               He is my best friend  positive          He best friend
    

    It can also be excluded by using pandas.Series.str.replace.

    pat = r'\b(?:{})\b'.format('|'.join(stop))
    test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
    test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r'\s+', ' ')
    # Same results.
    # 0              I love car
    # 1       This view amazing
    # 2    I feel great morning
    # 3       I excited concert
    # 4          He best friend
    

    If you can not import stopwords, you can download as follows.

    import nltk
    nltk.download('stopwords')
    

    Another way to answer is to import text.ENGLISH_STOP_WORDS from sklearn.feature_extraction.

    # Import stopwords with scikit-learn
    from sklearn.feature_extraction import text
    stop = text.ENGLISH_STOP_WORDS
    

    Notice that the number of words in the scikit-learn stopwords and nltk stopwords are different.

    0 讨论(0)
  • 2020-11-29 02:52

    Check out pd.DataFrame.replace(), it might work for you:

    In [42]: test.replace(to_replace='I', value="",regex=True)
    Out[42]:
                                  tweet     class
    0                     love this car  positive
    1              This view is amazing  positive
    2           feel great this morning  positive
    3   am so excited about the concert  positive
    4              He is my best friend  positive
    

    Edit : replace() would search for string(and even substrings). For e.g. it would replace rk from work if rk is a stopword which sometimes is not expected.

    Hence the use of regex here :

    for i in stop :
        test = test.replace(to_replace=r'\b%s\b'%i, value="",regex=True)
    
    0 讨论(0)
  • 2020-11-29 03:07

    Using List Comprehension

    test['tweet'].apply(lambda x: [item for item in x if item not in stop])
    

    Returns:

    0               [love, car]
    1           [view, amazing]
    2    [feel, great, morning]
    3        [excited, concert]
    4            [best, friend]
    
    0 讨论(0)
提交回复
热议问题