Does it make sense to use both countvectorizer and tfidfvectorizer as feature vectors for text clustering with KMeans?

我们两清 提交于 2019-12-23 23:15:18

问题


I am trying to build out my feature vectors from my csv file which contain about 1000 comments. One of my feature vector is tfidf using scikit learn's tfidf vectorizer. Does it make sense to also use count as a feature vector or is there a better feature vector that i should use?

And if i do end up using both Countvectorizer and tfidfvectorizer as my features, how should i fit them both into my Kmeans model (specifically the km.fit() part)? For now i am only able to fit the tfidf feature vectors into the model.

here is my code:

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)

#count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
#count_vectorized=count_vectorizerfit_transform(sentence_list)

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)

回答1:


Essentially what you are doing is finding a numeric representation of your text documents (feature engineering). In some problems the counts work better and in some others the tfidf representation is the best choice. You should really try them both. While the two representations are very similar and therefore carry approximately the same information, it could be the case that you will get better precision by using the full set of features(tfidf+counts). It is possible that you can get closer to the true model by searching in this feature space.

This is how you can horizontally stack your features:

import scipy.sparse

X = scipy.sparse.hstack([vectorized, count_vectorized])

Then you can just do:

model.fit(X, y)  # y is optional in some models


来源:https://stackoverflow.com/questions/27496014/does-it-make-sense-to-use-both-countvectorizer-and-tfidfvectorizer-as-feature-ve

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!