问题
I am doing text classification based on TF-IDF
Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF
Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF
value in vocabulary in each fold cross-validation?
Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF
value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF
SVM
and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).
My code is as follows:
# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)
# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)
scoring = ['accuracy']
clf = SVC(kernel='linear')
scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
print(scores)
My confusion is that whether my method doing TF-IDF
transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF
Vector Model Space using train data and then transform into TF-IDF
vectors with both train and test data? Just as follows:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC(kernel='linear')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)
回答1:
The StratifiedKFold
approach, which you had adopted to build the TfidfVectorizer()
is the right way, by doing so you are making sure that features are generated only based out of the training dataset.
If you think about building the TfidfVectorizer()
on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.
The simpler way could be using pipeline and cross_validate.
Use this!
from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))
scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores)
Note: It is not useful to do cross_validate
on the test data alone. we have to do on the [train + validation]
dataset.
来源:https://stackoverflow.com/questions/46010617/do-i-use-the-same-tfidf-vocabulary-in-k-fold-cross-validation