Sklearn how to get the 10 words from each topic

佐手、 提交于 2021-01-29 14:46:03

问题


I want to get the top 10 frequency of words from each topic, and after I use TfidfTransformer, I get: and the type is scipy.sparse.csr.csr_matrix

But I don't know how to get the highest ten from each list, in the data, (0, ****) means the 0 list, until (5170, *****) means the 5170 list.

I've tried to convert it into numpy, but it fails.

  (0, 19016)    0.024214182003181053
  (0, 28002)    0.03661443306612277
  (0, 6710) 0.02292100371816788
  (0, 27683)    0.013973969726506812
  (0, 27104)    0.02236713272585597
  (0, 6889) 0.0403281034949193
.
.
.
 (5169, 3236)   0.014432449220428715
  (5169, 19134) 0.014346823328868169
  (5169, 32915) 0.002047199186262409
  (5170, 35899) 0.49931779368675605
  (5170, 36444) 0.3479717717856863
  (5170, 15014) 0.5608169649159123

回答1:


You can use the TfidfVectorizer to expose the get_feature_names method. The transformer doesn't have this method, but the docs clearly state that the Vectorizer is equivalent to CountVectorizer followed by the transformer. If you don't want to use this, then I think you're going to be stuck building a lookup before you vectorize.

TfidfVectorizer in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Edit: to sort and slice the output of fit_transform from the TfidfVectorizer normal sparse matrix operations should work.



来源:https://stackoverflow.com/questions/53193422/sklearn-how-to-get-the-10-words-from-each-topic

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!