Difference between vocabulary and get_features() of TfidfVectorizer?

被刻印的时光 ゝ 提交于 2019-12-01 12:02:09

问题


I have

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs

# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray()  

I would like to associate for each value in single the according feature. What is now the structure of single? How could you map the position of a value in single to the feature?

How can I interpret the indices of vocabulary and get_features()? Are they related? Both have the features with indices according to the documentation. That is confusing?


回答1:


The attribute vocabulary_ outputs a dictionary in which all ngrams are the dictionary keys and the respective values are the column positions of each ngram (feature) in the tfidf matrix. The method get_feature_names() outputs a list in which the ngrams appear according to the column position of each feature. You can therefore use either to determine which tfidf column corresponds to which feature. In the example below, the tfidf matrix is easily converted to a pandas data frame using the output of get_feature_names() to name the columns. Also note that all values have been given an equal weight and that the sum of the squares of all weights is equal to one.

singleTFIDF.vocabulary_
Out[41]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

singleTFIDF.get_feature_names()
Out[42]: ['example', 'is', 'is simple', 'simple', 'simple example', 'this', 'this is']

import pandas as pd
df = pd.DataFrame(single.toarray(), columns=singleTFIDF.get_feature_names())

df
Out[48]: 
    example        is  is simple    simple  simple example      this   this is
0  0.377964  0.377964   0.377964  0.377964        0.377964  0.377964  0.377964


来源:https://stackoverflow.com/questions/54335229/difference-between-vocabulary-and-get-features-of-tfidfvectorizer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!