问题
When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?
I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)
Or should I do like this:
features = df.iloc[:,4:-1]
results = df.iloc[:,-1]
clf = LogisticRegression()
model = clf.fit(features, results)
accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)
Or maybe something different?
回答1:
Both your approaches are wrong.
In the first one, you apply cross validation to the test set, which is meaningless
In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted
clf
is not used by thecross_val_score
method, which does its own fitting)
Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:
- Either with a separate test set
- Or with cross validation
First way (test set):
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
clf = LogisticRegression()
model = clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_test = accuracy_score(y_test, y_pred)
Second way (cross validation):
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
clf = LogisticRegression()
# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)
回答2:
I will try to summarize the "best practice" here:
1) If you want to train your model, fine-tune parameters, and do final evaluation, I recommend you to split your data into training
|val
|test
.
You fit your model using the training
part, and then you check different parameter combinations on the val
part. Finally, when you're sure which classifier/parameter obtains the best result on the val
part, you evaluate on the test
to get the final rest.
Once you evaluate on the test
part, you shouldn't change the parameters any more.
2) On the other hand, some people follow another way, they split their data into training
and test
, and they finetune their model using cross-validation on the training part and at the end they evaluate it on the test
part.
If your data is quite large, I recommend you to use the first way, but if your data is small, the 2.
来源:https://stackoverflow.com/questions/60761775/should-cross-validation-score-be-performed-on-original-or-split-data