I\'m trying a classification with python. I\'m using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web c
The NaiveBayes
classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.
There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.
side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.
The problem is that by applying dimensionality reduction, you will generate negative features. However, Multinominal NB does not take negative features. Please refer to this questions.
Try another classifier such as RandomForest or try using sklearn.preprocessing.MinMaxScaler()
to scale your training features to [0,1]
Rather than converting a sparse
matrix to dense
(which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:
svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data)
And, citing from the TruncatedSVD
documentation:
In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).
which is exactly your use case.