is it possible Apply PCA on any Text Classification?

后端 未结 3 1285
南旧
南旧 2021-02-05 15:14

I\'m trying a classification with python. I\'m using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web c

相关标签:
3条回答
  • 2021-02-05 15:19

    The NaiveBayes classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.

    There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.

    side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.

    0 讨论(0)
  • 2021-02-05 15:23

    The problem is that by applying dimensionality reduction, you will generate negative features. However, Multinominal NB does not take negative features. Please refer to this questions.

    Try another classifier such as RandomForest or try using sklearn.preprocessing.MinMaxScaler() to scale your training features to [0,1]

    0 讨论(0)
  • 2021-02-05 15:43

    Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:

    svd = TruncatedSVD(n_components=5, random_state=42)
    data = svd.fit_transform(data) 
    

    And, citing from the TruncatedSVD documentation:

    In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

    which is exactly your use case.

    0 讨论(0)
提交回复
热议问题