What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?

感情迁移 提交于 2019-12-11 06:47:25

问题


In Tfidf.fit_transform we are only using the parameters X and have not used y for fitting the data set. Is this right? We are generating the tfidf matrix for only parameters of the training set.We are not using ytrain in fitting the model. Then how do we make predictions for the test data set


回答1:


https://datascience.stackexchange.com/a/12346/122 has a good explanation of why it's call fit(), transform() and fit_transform().

In gist,

  • fit(): Fit the vectorizer/model to the training data and save the vectorizer/model to a variable (returns sklearn.feature_extraction.text.TfidfVectorizer)

  • transform(): Use the variable output from fit() to transformer validation/test data (returns scipy.sparse.csr.csr_matrix)

  • fit_transform(): Sometimes you to directly transform the training data, so you use fit() + transform() together, thus fit_transform(). (returns scipy.sparse.csr.csr_matrix)


E.g.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix


# The *TfidfVectorizer* from sklearn expects list of strings as input.
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower()
sent1 = "Mr brown jumps over the lazy fox .".lower()
sent2 = "Roses are red , the chocolates are brown .".lower()
sent3 = "The frank dog jumps through the red roses .".lower()

dataset = [sent0, sent1, sent2, sent3]

# Initialize the parameters of the vectorizer
vectorizer = TfidfVectorizer(input=dataset, analyzer='word', ngram_range=(1,1),
                     min_df = 0, stop_words=None)

[out]:

# Learns the vocabulary of vectorizer based on the initialized parameter.
>>> vectorizer =  vectorizer.fit(dataset)

# Apply the vectorizer to new sentence.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."])
<1x15 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

# Output to array form.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray()
array([[0.        , 0.31342551, 0.        , 0.38714286, 0.        ,
        0.        , 0.31342551, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.38714286, 0.51249178, 0.49104163]])

# When you don't need to save the vectorizer for re-using.
>>> vectorizer.fit_transform(dataset)
<4x15 sparse matrix of type '<class 'numpy.float64'>'
    with 28 stored elements in Compressed Sparse Row format>

>>> vectorizer.fit_transform(dataset).toarray()
array([[0.        , 0.49642852, 0.        , 0.30659399, 0.30659399,
        0.        , 0.24821426, 0.30659399, 0.        , 0.30659399,
        0.38887561, 0.        , 0.        , 0.40586285, 0.        ],
       [0.        , 0.32107915, 0.        , 0.        , 0.39659663,
        0.        , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
        0.        , 0.        , 0.        , 0.26250325, 0.        ],
       [0.76012588, 0.24258925, 0.38006294, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.29964599, 0.29964599, 0.19833261, 0.        ],
       [0.        , 0.        , 0.        , 0.34049544, 0.        ,
        0.4318753 , 0.27566041, 0.        , 0.        , 0.        ,
        0.        , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])


>>> type(vectorizer)
<class 'sklearn.feature_extraction.text.TfidfVectorizer'>

>>> type(vectorizer.fit_transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

>>> type(vectorizer.transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>


来源:https://stackoverflow.com/questions/53027864/what-is-the-difference-between-tfidfvectorizer-fit-transfrom-and-tfidf-transform

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!