Do I use the same Tfidf vocabulary in k-fold cross_validation

前端未结

关注

 1  1654

花落未央 2021-02-20 01:09

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I\'m evaluating the classifier usin

1条回答

谎友^ (楼主)

2021-02-20 02:00
The StratifiedKFold approach, which you had adopted to build the TfidfVectorizer() is the right way, by doing so you are making sure that features are generated only based out of the training dataset.

If you think about building the TfidfVectorizer() on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.

The simpler way could be using pipeline and cross_validate.

Use this!
```
from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))

scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores) 
```
Note: It is not useful to do cross_validate on the test data alone. we have to do on the [train + validation] dataset.
0 讨论(0)
发布评论:

提交评论
- 加载中...