how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

问题

TfidfVectorizer provides an easy way to encode & transform texts into vectors.

My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?

update:

Maybe I should have put more details on the question:

What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)

回答1:

If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.

You can do that in sklearn easily with the GridSearchCV and Pipeline objects

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=stop_words)),
    ('clf', OneVsRestClassifier(MultinomialNB(
        fit_prior=True, class_prior=None))),
])
parameters = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'clf__estimator__alpha': (1e-2, 1e-3)
}

grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)

print("Best parameters set:")
print grid_search_tune.best_estimator_.steps

来源：https://stackoverflow.com/questions/44066264/how-to-choose-parameters-in-tfidfvectorizer-in-sklearn-during-unsupervised-clust

标签

python

scikit-learn

nlp

tf-idf

tfidfvectorizer

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!