I am learning about random forests in scikit learn and as an example I would like to use Random forest classifier for text classification, with my own dataset. So first I vectorized the text with tfidf and for classification:
from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_train, y_train) prediction = classifier.predict(X_test)
When I run the classification I got this:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
then I used the .toarray()
for X_train
and I got the following:
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
From a previous question as I understood I need to reduce the dimensionality of the numpy array so I do the same:
from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=300) X_reduced_train = pca.fit_transform(X_train) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_reduced_train, y_train) prediction = classifier.predict(X_testing)
Then I got this exception:
File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__ raise TypeError("sparse matrix length is ambiguous; use getnnz()" TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
The I tried the following:
prediction = classifier.predict(X_train.getnnz())
And got this:
File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) TypeError: object of type 'int' has no len()
Two questions were raised from this: How can I use Random forests to classify correctly? and what's happening with X_train
?.
Then I tried the following:
df = pd.read_csv('/path/file.csv', header=0, sep=',', names=['id', 'text', 'label']) X = tfidf_vect.fit_transform(df['text'].values) y = df['label'].values from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=2) X = pca.fit_transform(X) a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(a_train, b_train) prediction = classifier.predict(a_test) from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report print '\nscore:', classifier.score(a_train, b_test) print '\nprecision:', precision_score(b_test, prediction) print '\nrecall:', recall_score(b_test, prediction) print '\n confussion matrix:\n',confusion_matrix(b_test, prediction) print '\n clasification report:\n', classification_report(b_test, prediction)