is it possible Apply PCA on any Text Classification?

后端未结

关注

 3  1290

I\'m trying a classification with python. I\'m using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web c

相关标签:

3条回答

一整个雨季

2021-02-05 15:19

The NaiveBayes classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.

There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.

side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2021-02-05 15:23

The problem is that by applying dimensionality reduction, you will generate negative features. However, Multinominal NB does not take negative features. Please refer to this questions.

Try another classifier such as RandomForest or try using sklearn.preprocessing.MinMaxScaler() to scale your training features to [0,1]

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2021-02-05 15:43
Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:
```
svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data) 
```
And, citing from the TruncatedSVD documentation:

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

which is exactly your use case.
0 讨论(0)
发布评论:

提交评论
- 加载中...