问题
After a lot of reading and inspecting the pipeline.fit() operation under different verbose
param settings, I'm still confused why a pipeline of mine visits a certain step's transform
method so many times.
Below is a trivial example pipeline
, fit
with GridSearchCV
, using 3-fold cross-validation, but a param-grid with only one set of hyperparams. So I expected three runs through the pipeline. Both step1
and step2
have fit
called three times, as expected, but each step has transform
called several more times. Why is this? Minimal code example and log output below.
# library imports
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline
# Load toy data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')
# Define a couple trivial pipeline steps
class mult_everything_by(TransformerMixin, BaseEstimator):
def __init__(self, multiplier=2):
self.multiplier = multiplier
def fit(self, X, y=None):
print "Fitting step 1"
return self
def transform(self, X, y=None):
print "Transforming step 1"
return X* self.multiplier
class do_nothing(TransformerMixin, BaseEstimator):
def __init__(self, meaningless_param = 'hello'):
self.meaningless_param=meaningless_param
def fit(self, X, y=None):
print "Fitting step 2"
return self
def transform(self, X, y=None):
print "Transforming step 2"
return X
# Define the steps in our Pipeline
pipeline_steps = [('step1', mult_everything_by()),
('step2', do_nothing()),
('classifier', LogisticRegression()),
]
pipeline = Pipeline(pipeline_steps)
# To keep this example super minimal, this param grid only has one set
# of hyperparams, so we are only fitting one type of model
param_grid = {'step1__multiplier': [2], #,3],
'step2__meaningless_param': ['hello'] #, 'howdy', 'goodbye']
}
# Define model-search process/object
# (fit one model, 3-fits due to 3-fold cross-validation)
cv_model_search = GridSearchCV(pipeline,
param_grid,
cv = KFold(3),
refit=False,
verbose = 0)
# Fit all (1) models defined in our model-search object
cv_model_search.fit(X,y)
Output:
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
Fitting step 1
Transforming step 1
Fitting step 2
Transforming step 2
Transforming step 1
Transforming step 2
Transforming step 1
Transforming step 2
回答1:
Because you have used GridSearchCV
with cv = KFold(3)
which will do a cross-validation of your model. Here's what happens:
- It will split the data into two parts: train and test.
- For train, it will fit and transform each part of pipeline (excluding last, which is the classifier). Thats why you are seeing
fit step1, transform step1, fit step2, transform step2
. - It will fit the transformed data on the classifier (which is not printed in your output.
Edited Now comes the scoring part. Here we dont want to re-fit the parts again. We will use the information learnt during previous fitting. So each part of the pipeline will only call transform(). Thats the reason for
Transforming step 1, Transforming step 2
.Its showing two times because in GridSearchCV, default behaviour is to compute the score of both training and testing data. This behaviour is geverned by
return_train_score
. You can setreturn_train_score=False
and will only see them once.This transformed test data will be used to predict the output from the classifier. (Again, no fitting on test, only predicting or transforming).
- The predicted values will be used to compare with the actual values to score the model.
- The steps 1-6 will be repeated 3 times
(KFold(3))
. Now have a look at your params:
param_grid = {'step1__multiplier': [2], #,3], 'step2__meaningless_param': ['hello'] #, 'howdy', 'goodbye'] }
When expanding, it becomes only single combination i.e.:
Combination1: 'step1__multiplier'=2, 'step2__meaningless_param' = 'hello'
If you have provided more options, which you have commented more combinations would be possible like:
Combination1: 'step1__multiplier'=2, 'step2__meaningless_param' = 'hello'
Combination2: 'step1__multiplier'=3, 'step2__meaningless_param' = 'hello'
Combination3: 'step1__multiplier'=2, 'step2__meaningless_param' = 'howdy'
and so on..
The steps 1-7 will be repeated for each possible combination.
- The combination which gave highest average score on the test folds of the cross-validation will be chosen to finally fit the model with complete data (no division into train and test).
But you have kept
refit=False
. So the model will not be fitted again. Else you would have seen one more output ofFitting step 1 Transforming step 1 Fitting step 2 Transforming step 2
Hope this clears this up. Feel free to ask any more info.
来源:https://stackoverflow.com/questions/47062970/why-does-sklearn-pipeline-call-transform-so-many-more-times-than-fit