问题
I am comparing the performance of two programs about KerasRegressor
using Scikit-Learn StandardScaler
: one program with Scikit-Learn Pipeline
and one program without the Pipeline
.
Program 1:
estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)
Program 2:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?
回答1:
In the second case, you are calling StandardScaler.fit_transform()
on both X_train
and X_test
. Its wrong usage.
You should call fit_transform()
on X_train
and then call only transform()
on the X_test
. Because thats what the Pipeline
does.
The Pipeline as the documentation states, will:
fit():
Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator
predict():
Apply transforms to the data, and predict with the final estimator
So you see, it will only apply transform()
to the test data, not fit_transform()
.
So elaborate my point, your code should be:
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
#This is the change
X_test = scale.transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)
Calling fit()
or fit_transform()
on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.
Edit: To answer the question in comment:
See, fit_transform()
is just a shortcut function for doing fit()
and then transform()
. For StandardScaler
, fit()
doesnt return anything, just learns the mean and standard deviation of data. And then transform()
applies the learning on the data to return new scaled data.
So what you are saying leads to below two scenarios:
Scenario 1: Wrong
1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model.
No need to scale again.
Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)
1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]
You can try any of the scenario and maybe it will increase the performance of your model.
But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.
So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:
1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.
Hope it makes sense to you.
来源:https://stackoverflow.com/questions/43816718/keras-regression-using-scikit-learn-standardscaler-with-pipeline-and-without-pip