Tfidfvectorizer - How can I check out processed tokens?

♀尐吖头ヾ 提交于 2021-01-04 05:40:43

问题


How can I check the strings tokenized inside TfidfVertorizer()? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods. I want to observe how it tokenizes strings so that I can more easily tune my model.

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

I want something like this:

>>>vectorizer.get_processed_tokens()
[['this', 'is', 'first', 'document'],
 ['this', 'document', 'is', 'second', 'document'],
 ['this', 'is', 'the', 'third', 'one'],
 ['is', 'this', 'the', 'first', 'document']]

How can I do this?


回答1:


build_tokenizer() would exactly serve this purpose.

Try this!

tokenizer = lambda docs: [vectorizer.build_tokenizer()(doc) for doc in docs]

tokenizer(corpus)

Output:

[['This', 'is', 'the', 'first', 'document'],
 ['This', 'document', 'is', 'the', 'second', 'document'],
 ['And', 'this', 'is', 'the', 'third', 'one'],
 ['Is', 'this', 'the', 'first', 'document']]

​One liner solution would be

list(map(vectorizer.build_tokenizer(),corpus))



回答2:


I'm not sure there's a built in sklearn function to get your output in that format but I'm pretty sure a fitted TfidfVectorizer instance has a vocabulary_ attribute that returns a dictionary of the mapping of terms to feature indices. Read more here.

A combination of that and the output of the get_feature_names method should be able to do this for you. Hope it helps.




回答3:


This might not be syntactically correct (doing this on memory), but its the general idea:

Y = X.to_array()
Vocab = vectorizer.get_feature_names()
fake_corpus = []
for doc in Y:
    l = [Vocab[word_index] for word_index in doc]
    fake_corpus.append(l)

With Y you have the indexs of your words for each doc in the corpus, with vocab you have the words a given index corresponds too, so you basically just need to combine them.



来源:https://stackoverflow.com/questions/55352301/tfidfvectorizer-how-can-i-check-out-processed-tokens

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!