import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSV
You should only call transform()
on test data. Never fit()
or its variations like fit_transform()
or fit_predict()
etc. They should be used only on training data.
So change the line:
Y_test = mlb.fit_transform(y_test)
to
Y_test = mlb.transform(y_test)
Explanation:
When you call fit()
or fit_transform()
, the mlb forgets its previous learnt data and learn the new supplied data. This can be problematic when Y_train
and Y_test
may have difference in labels as your case have.
In your case, Y_train
have 49 different kinds of labels, whereas Y_test
have only 42 different labels. But this doesn't mean that Y_test is 7 labels short of Y_train
. It can be possible that Y_test
may have entirely different set of labels, which when binarized results in 42 columns, and that will affect the results.