Do I use the same Tfidf vocabulary in k-fold cross_validation

前端 未结 1 1654
花落未央
花落未央 2021-02-20 01:09

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I\'m evaluating the classifier usin

1条回答
  •  谎友^
    谎友^ (楼主)
    2021-02-20 02:00

    The StratifiedKFold approach, which you had adopted to build the TfidfVectorizer() is the right way, by doing so you are making sure that features are generated only based out of the training dataset.

    If you think about building the TfidfVectorizer() on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.

    The simpler way could be using pipeline and cross_validate.

    Use this!

    from sklearn.pipeline import make_pipeline
    clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))
    
    scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
    print(scores) 
    

    Note: It is not useful to do cross_validate on the test data alone. we have to do on the [train + validation] dataset.

    0 讨论(0)
提交回复
热议问题