问题
I want to get the top 10 frequency of words from each topic, and after I use TfidfTransformer, I get: and the type is scipy.sparse.csr.csr_matrix
But I don't know how to get the highest ten from each list, in the data, (0, ****) means the 0 list, until (5170, *****) means the 5170 list.
I've tried to convert it into numpy, but it fails.
(0, 19016) 0.024214182003181053
(0, 28002) 0.03661443306612277
(0, 6710) 0.02292100371816788
(0, 27683) 0.013973969726506812
(0, 27104) 0.02236713272585597
(0, 6889) 0.0403281034949193
.
.
.
(5169, 3236) 0.014432449220428715
(5169, 19134) 0.014346823328868169
(5169, 32915) 0.002047199186262409
(5170, 35899) 0.49931779368675605
(5170, 36444) 0.3479717717856863
(5170, 15014) 0.5608169649159123
回答1:
You can use the TfidfVectorizer
to expose the get_feature_names
method. The transformer doesn't have this method, but the docs clearly state that the Vectorizer
is equivalent to CountVectorizer
followed by the transformer. If you don't want to use this, then I think you're going to be stuck building a lookup before you vectorize.
TfidfVectorizer in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Edit: to sort and slice the output of fit_transform
from the TfidfVectorizer
normal sparse matrix operations should work.
来源:https://stackoverflow.com/questions/53193422/sklearn-how-to-get-the-10-words-from-each-topic