Does the training set and testing set have to be different from the predicting set?

问题

I know the general rule that we should test a trained classifier only on the testing set.

But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?

And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored columns), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?

I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.

回答1:

I think you are mixing up some concepts, so I will try to give a general explanation for Supervised Learning.

The training set is what your algorithm LEARNS on. You split it in X (features) and Y (target variable).
The test set is a set that you use to SCORE your model, and it must contain data that was not in the training set. This means that a test set also has X and Y (meaning that you know the value of the target). What happens is that you PREDICT f(Y) based on X, and compare it with the Y you have, and see how good your predictions are
A prediction set is simply new data! This means that usually you DO NOT have a target, since the whole point of supervised learning is predicting it. You will only have your X (features) and you will predict f(X) (your estimate of the target Y) and use it for whatever you need.

So, in the end a test set is simply a prediction set for which you have a target to compare your estimation to.

For time series, it is a bit more complicated, because often the features (X) are transformations on past data of the target variable (Y). For example, if you want to predict today's SP500 price, you might want to use the average of the last 30 days as a feature. This means that for every new day, you need to recompute this feature over the past days.
In general though, I would suggest starting with NON time series data if you're new to ML, as Time Series is much harder in terms of feature engineering and data management and it is easy to make mistakes.

回答2:

Answering myself after half a year here. The first answer was a slight misunderstanding about the term "time series" which I had caused with an unclear question (edited).

The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: no.

The question above Do I have to shift all of the features has the simple answer: yes.

In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.

来源：https://stackoverflow.com/questions/59210109/does-the-training-set-and-testing-set-have-to-be-different-from-the-predicting-s

标签

python

machine-learning

classification

data-modeling

predict