问题
I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers).
I thought that it is enough to call WordNetLemmatizer.lemmatize(word), and then PorterStemmer - but all 'have', 'has', 'had', etc are not being transformed to 'have' by the lemmatizer - and it goes for other words as well. Then I have read, that I am supposed to provide a hint for the lemmatizer - tag representing a type of the word - whether it is noun, verb, adjective, etc.
My question is - how do I get these tags? What I am supposed to excecute on those documents to get this?
I am using python3.4, and I am lemmatizing + stemming single word at a time. I tried WordNetLemmatizer, and EnglishStemmer from nltk and also stem() from stemming.porter2.
回答1:
Ok, I googled more and I found out how to get these tags. First one have to do some preprocessing, to be sure that file will get tokenized (in my case it was about removing some stuff left off after conversion from pdf to txt).
Then these file has to be tokenized into sentences, then each sentence into word array, and that can be tagged by nltk tagger. With that lemmatization can be done, and then stemming added on top of it.
from nltk.tokenize import sent_tokenize, word_tokenize
# use sent_tokenize to split text into sentences, and word_tokenize to
# to split sentences into words
from nltk.tag import pos_tag
# use this to generate array of tuples (word, tag)
# it can be then translated into wordnet tag as in
# [this response][1].
from nltk.stem.wordnet import WordNetLemmatizer
from stemming.porter2 import stem
# code from response mentioned above
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
with open(myInput, 'r') as f:
data = f.read()
sentences = sent_tokenize(data)
ignoreTypes = ['TO', 'CD', '.', 'LS', ''] # my choice
lmtzr = WordNetLemmatizer()
for sent in sentences:
words = word_tokenize(sentence)
tags = pos_tag(words)
for (word, type) in tags:
if type in ignoreTypes:
continue
tag = get_wordnet_pos(type)
if tag == '':
continue
lema = lmtzr.lemmatize(word, tag)
stemW = stem(lema)
And at this point I get stemmed word stemW
which I can then write to file, and use these to count tfidf vectors per document.
来源:https://stackoverflow.com/questions/40568856/how-to-provide-or-generate-tags-for-nltk-lemmatizers