Difference between vocabulary and get_features() of TfidfVectorizer?

两盒软妹~` 提交于 2019-12-02 07:20:54

The attribute vocabulary_ outputs a dictionary in which all ngrams are the dictionary keys and the respective values are the column positions of each ngram (feature) in the tfidf matrix. The method get_feature_names() outputs a list in which the ngrams appear according to the column position of each feature. You can therefore use either to determine which tfidf column corresponds to which feature. In the example below, the tfidf matrix is easily converted to a pandas data frame using the output of get_feature_names() to name the columns. Also note that all values have been given an equal weight and that the sum of the squares of all weights is equal to one.

singleTFIDF.vocabulary_
Out[41]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

singleTFIDF.get_feature_names()
Out[42]: ['example', 'is', 'is simple', 'simple', 'simple example', 'this', 'this is']

import pandas as pd
df = pd.DataFrame(single.toarray(), columns=singleTFIDF.get_feature_names())

df
Out[48]: 
    example        is  is simple    simple  simple example      this   this is
0  0.377964  0.377964   0.377964  0.377964        0.377964  0.377964  0.377964
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!