问题
I'm working on a small project where I'm trying to apply SMOTE "Synthetic Minority Over-sampling Technique", where my data is imbalanced ..
I created a customized transformerMixin for the SMOTE function ..
class smote(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
print(X.shape, ' ', type(X)) # (57, 28) <class 'numpy.ndarray'>
print(len(y), ' ', type) # 57 <class 'list'>
smote = SMOTE(kind='regular', n_jobs=-1)
X, y = smote.fit_sample(X, y)
return X
def transform(self, X):
return X
model = Pipeline([
('posFeat1', featureVECTOR()),
('sca1', StandardScaler()),
('smote', smote()),
('classification', SGDClassifier(loss='hinge', max_iter=1, random_state = 38, tol = None))
])
model.fit(train_df, train_df['label'].values.tolist())
predicted = model.predict(test_df)
I implemented the SMOTE on the FIT function because I don't want it to be applied on the test data ..
and unfortunately, I got this error:
model.fit(train_df, train_df['label'].values.tolist())
File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
**fit_params_steps[name])
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Python35\lib\site-packages\sklearn\base.py", line 520, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
AttributeError: 'numpy.ndarray' object has no attribute 'transform'
回答1:
fit()
mehtod should return self, not the transformed values. If you need the functioning only for train data and not test, then implement the fit_transform()
method.
class smote(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
print(X.shape, ' ', type(X)) # (57, 28) <class 'numpy.ndarray'>
print(len(y), ' ', type) # 57 <class 'list'>
self.smote = SMOTE(kind='regular', n_jobs=-1).fit(X, y)
return self
def fit_transform(self, X, y=None):
self.fit(X, y)
return self.smote.sample(X, y)
def transform(self, X):
return X
Explanation: On the train data (i.e. when pipeline.fit()
is called) Pipeline will first try to call fit_transform() on the internal objects. If not found, then it will call fit()
and transform()
separately.
On the test data, only the transform()
is called for each internal object, so here your supplied test data should not be changed.
Update: The above code will still throw error.
You see, when you oversample the supplied data, the number of samples in X
and y
both change. But the pipeline will only work on the X
data. It will not change the y
. So either you will get error about unmatched samples to labels if I correct the above error. If by chance, the generated samples are equal to previous samples, then also the y
values will not correspond to the new samples.
Working solution: Silly me.
You can just use the Pipeline from the imblearn package in place of scikit-learn Pipeline. It takes care automatically to re-sample
when called fit()
on the pipeline, and does not re-sample test data (when called transform()
or predict()
).
Actually I knew that imblearn.Pipeline handles sample()
method, but was thrown off when you implemented a custom class and said that test data must not change. It did not come to my mind that thats the default behaviour.
Just replace
from sklearn.pipeline import Pipeline
with
from imblearn.pipeline import Pipeline
and you are all set. No need to make a custom class as you did. Just use original SMOTE. Something like:
random_state = 38
model = Pipeline([
('posFeat1', featureVECTOR()),
('sca1', StandardScaler()),
# Original SMOTE class
('smote', SMOTE(random_state=random_state)),
('classification', SGDClassifier(loss='hinge', max_iter=1, random_state=random_state, tol=None))
])
来源:https://stackoverflow.com/questions/49770851/customized-transformermixin-with-data-labels-in-sklearn