How to convert token list into wordnet lemma list using nltk?

问题

I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

I am checking the synonyms through this code:

syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):
    # Eliminating punctuations
    text = "".join([word for word in text if word not in string.punctuation])
    # tokenizing
    tokens = re.split("\W+", text)
    # lemmatizing and removing stopwords
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    # converting token list into synset
    syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
    return text

I am getting the error :

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'

How to get the token list for each lemma?

The full code:

import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Image

stopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()

data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:
    pageData += page.extractText()
# print(pageData)


def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split("\W+", text)
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]
    return syns


print(clean_text(pageData))

回答1:

You are calling wordnet.synsets(text) with a list of words (check what is text at that point) and you should call it with a word. The preprocessing of wordnet.synsets is trying to apply .lower() to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower').

Below there is a functional version of clean_text with a fix of this problem:

import string
import re
import nltk
from nltk.corpus import wordnet

stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()

def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split("\W+", text)
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    lemmas = []
    for token in text:
        lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]
    return lemmas


text = "The grass was greener."

print(clean_text(text))

Returns:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

来源：https://stackoverflow.com/questions/53416780/how-to-convert-token-list-into-wordnet-lemma-list-using-nltk

标签

python

nltk

wordnet