Lemmatization of all pandas cells

后端 未结 2 1742
暖寄归人
暖寄归人 2021-01-02 08:51

I have a panda dataframe. There is one column, let\'s name it: \'col\' Each entry of this column is a list of words. [\'word1\', \'word2\', etc.]

How can I efficient

相关标签:
2条回答
  • 2021-01-02 09:29

    You can use apply from pandas with a function to lemmatize each words in the given string. Note that there are many ways to tokenize your text. You might have to remove symbols like . if you use whitespace tokenizer.

    Below, I give an example on how to lemmatize a column of example dataframe.

    import nltk
    
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    def lemmatize_text(text):
        return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    
    df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
    df['text_lemmatized'] = df.text.apply(lemmatize_text)
    
    0 讨论(0)
  • 2021-01-02 09:31
    |col| 
    ['Sushi Bars', 'Restaurants']
    ['Burgers', 'Fast Food', 'Restaurants']
    
    wnl = WordNetLemmatizer()
    

    The below creates a function which takes list of words and returns list of lemmatized words. This should work.

    def lemmatize(s):
    '''For lemmatizing the word
    '''
         s = [wnl.lemmatize(word) for word in s]
         return s
    
    dataset = dataset.assign(col_lemma = dataset.col.apply(lambda x: lemmatize(x))
    
    0 讨论(0)
提交回复
热议问题