I\'m trying a classification with python. I\'m using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web c
Rather than converting a sparse
matrix to dense
(which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:
svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data)
And, citing from the TruncatedSVD
documentation:
In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).
which is exactly your use case.