Difference between vocabulary and get_features() of TfidfVectorizer?

问题

I have

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs

# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray()

I would like to associate for each value in single the according feature. What is now the structure of single? How could you map the position of a value in single to the feature?

How can I interpret the indices of vocabulary and get_features()? Are they related? Both have the features with indices according to the documentation. That is confusing?

回答1:

The attribute vocabulary_ outputs a dictionary in which all ngrams are the dictionary keys and the respective values are the column positions of each ngram (feature) in the tfidf matrix. The method get_feature_names() outputs a list in which the ngrams appear according to the column position of each feature. You can therefore use either to determine which tfidf column corresponds to which feature. In the example below, the tfidf matrix is easily converted to a pandas data frame using the output of get_feature_names() to name the columns. Also note that all values have been given an equal weight and that the sum of the squares of all weights is equal to one.

singleTFIDF.vocabulary_
Out[41]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

singleTFIDF.get_feature_names()
Out[42]: ['example', 'is', 'is simple', 'simple', 'simple example', 'this', 'this is']

import pandas as pd
df = pd.DataFrame(single.toarray(), columns=singleTFIDF.get_feature_names())

df
Out[48]: 
    example        is  is simple    simple  simple example      this   this is
0  0.377964  0.377964   0.377964  0.377964        0.377964  0.377964  0.377964

来源：https://stackoverflow.com/questions/54335229/difference-between-vocabulary-and-get-features-of-tfidfvectorizer

标签

python

scikit-learn

tfidfvectorizer