is it possible Apply PCA on any Text Classification?

I'm trying a classification with python. I'm using Naive Bayes MultinomialNB classifier for the web pages (Retrieving data form web to text , later I classify this text: web classification).

Now, I'm trying to apply PCA on this data, but python is giving some errors.

My code for classification with Naive Bayes :

from sklearn import PCA
from sklearn import RandomizedPCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
vectorizer = CountVectorizer()
classifer = MultinomialNB(alpha=.01)

x_train = vectorizer.fit_transform(temizdata)
classifer.fit(x_train, y_train)

This naive bayes classification gives that output :

>>> x_train
<43x4429 sparse matrix of type '<class 'numpy.int64'>'
    with 6302 stored elements in Compressed Sparse Row format>

>>> print(x_train)
(0, 2966)   1
(0, 1974)   1
(0, 3296)   1
..
..
(42, 1629)  1
(42, 2833)  1
(42, 876)   1

Than I try to apply PCA on my data (temizdata) :

>>> v_temizdata = vectorizer.fit_transform(temizdata)
>>> pca_t = PCA.fit_transform(v_temizdata)
>>> pca_t = PCA().fit_transform(v_temizdata)

but this raise following erros:

raise TypeError('A sparse matrix was passed, but dense ' TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

I convert matrix to densematrix or numpy array. Then I tried to classfy new densematrix , but I have error.

My main aim is that test PCA effect on Classification on text.

Convert to dense array :

v_temizdatatodense = v_temizdata.todense()
pca_t = PCA().fit_transform(v_temizdatatodense)

Finally try classfy :

classifer.fit(pca_t,y_train)

error for final classfy :

raise ValueError("Input X must be non-negative") ValueError: Input X must be non-negative

On one side my data (temizdata) is put in Naive Bayes only, on the other side temizdata firstly put in PCA (for reduce inputs) than classify. __

Rather than converting a sparse matrix to dense (which is discouraged), I would use scikits-learn's TruncatedSVD, which is a PCA-like dimmensionality reduction algorithm (using by default Randomized SVD) which works on sparse data:

svd = TruncatedSVD(n_components=5, random_state=42)
data = svd.fit_transform(data)

And, citing from the TruncatedSVD documentation:

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

which is exactly your use case.

The NaiveBayes classifier needs discrete-valued features, but the PCA breaks this property of the features. You will have to use a different classifier if you want to use PCA.

There may be other dimensionality reduction methods that work with NB, but I don't know about those. Maybe simple feature selection could work.

side note: You could try to discretize the features after applying the PCA, but I don't think this is a good idea.

The problem is that by applying dimensionality reduction, you will generate negative features. However, Multinominal NB does not take negative features. Please refer to this questions.

Try another classifier such as RandomForest or try using sklearn.preprocessing.MinMaxScaler() to scale your training features to [0,1]

来源：https://stackoverflow.com/questions/34725726/is-it-possible-apply-pca-on-any-text-classification

标签

python

scikit-learn

pca

naivebayes