问题
I am preprocessing text data. However, I am facing issue with lemmatizing. Below is the sample text:
'An 18-year-old boy was referred to prosecutors Thursday for allegedly stealing about ¥15 million ($134,300) worth of cryptocurrency last year by hacking a digital currency storage website, police said.', 'The case is the first in Japan in which criminal charges have been pursued against a hacker over cryptocurrency losses, the police said.', '\n', 'The boy, from the city of Utsunomiya, Tochigi Prefecture, whose name is being withheld because he is a minor, allegedly stole the money after hacking Monappy, a website where users can keep the virtual currency monacoin, between Aug. 14 and Sept. 1 last year.', 'He used software called Tor that makes it difficult to identify who is accessing the system, but the police identified him by analyzing communication records left on the website’s server.', 'The police said the boy has admitted to the allegations, quoting him as saying, “I felt like I’d found a trick no one knows and did it as if I were playing a video game.”', 'He took advantage of a weakness in a feature of the website that enables a user to transfer the currency to another user, knowing that the system would malfunction if transfers were repeated over a short period of time.', 'He repeatedly submitted currency transfer requests to himself, overwhelming the system and allowing him to register more money in his account.', 'About 7,700 users were affected and the operator will compensate them.', 'The boy later put the stolen monacoins in an account set up by a different cryptocurrency operator, received payouts in a different cryptocurrency and bought items such as a smartphone, the police said.', 'According to the operator of Monappy, the stolen monacoins were kept using a system with an always-on internet connection, and those kept offline were not stolen.'
My code is:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
df = pd.read_csv('All Articles.csv')
df['Articles'] = df['Articles'].str.lower()
stemming = PorterStemmer()
stops = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
def identify_tokens(row):
Articles = row['Articles']
tokens = nltk.word_tokenize(Articles)
token_words = [w for w in tokens if w.isalpha()]
return token_words
df['words'] = df.apply(identify_tokens, axis=1)
def stem_list(row):
my_list = row['words']
stemmed_list = [stemming.stem(word) for word in my_list]
return (stemmed_list)
df['stemmed_words'] = df.apply(stem_list, axis=1)
def lemma_list(row):
my_list = row['stemmed_words']
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
def remove_stops(row):
my_list = row['lemma_words']
meaningful_words = [w for w in my_list if not w in stops]
return (meaningful_words)
df['stem_meaningful'] = df.apply(remove_stops, axis=1)
def rejoin_words(row):
my_list = row['stem_meaningful']
joined_words = (" ".join(my_list))
return joined_words
df['processed'] = df.apply(rejoin_words, axis=1)
As it is clear from the code that I am using pandas. However here I have given sample text.
My problem area is :
def lemma_list(row):
my_list = row['stemmed_words']
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
Though the code is running without any error lemma function is not working expectedly.
Thanks in Advance.
回答1:
In your code above you are trying to lemmatize words that have been stemmed. When the lemmatizer runs into a word that it doesn't recognize, it'll simply return that word. For instance stemming offline
produces offlin
and when you run that through the lemmatizer it just gives back the same word, offlin
.
Your code should be modified to lemmatize words
, like this...
def lemma_list(row):
my_list = row['words'] # Note: line that is changed
lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
print('Words: ', df.ix[0,'words'])
print('Stems: ', df.ix[0,'stemmed_words'])
print('Lemmas: ', df.ix[0,'lemma_words'])
This produces...
Words: ['and', 'those', 'kept', 'offline', 'were', 'not', 'stolen']
Stems: ['and', 'those', 'kept', 'offlin', 'were', 'not', 'stolen']
Lemmas: ['and', 'those', 'keep', 'offline', 'be', 'not', 'steal']
Which is is correct.
来源:https://stackoverflow.com/questions/58618352/how-to-pass-part-of-speech-in-wordnetlemmatizer