I am doing text classification based on TF-IDF
Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I\'m evaluating the classifier usin
The StratifiedKFold
approach, which you had adopted to build the TfidfVectorizer()
is the right way, by doing so you are making sure that features are generated only based out of the training dataset.
If you think about building the TfidfVectorizer()
on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.
The simpler way could be using pipeline and cross_validate.
Use this!
from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))
scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores)
Note: It is not useful to do cross_validate
on the test data alone. we have to do on the [train + validation]
dataset.