I want to create a Pipeline in Scikit-Learn with a specific step being outlier detection and removal, allowing the transformed data to be passed to other transformers and estima
Yes. Subclass the TransformerMixin and build a custom transformer. Here is an extension to one of the existing outlier detection methods:
from sklearn.pipeline import Pipeline, TransformerMixin
from sklearn.neighbors import LocalOutlierFactor
class OutlierExtractor(TransformerMixin):
def __init__(self, **kwargs):
"""
Create a transformer to remove outliers. A threshold is set for selection
criteria, and further arguments are passed to the LocalOutlierFactor class
Keyword Args:
neg_conf_val (float): The threshold for excluding samples with a lower
negative outlier factor.
Returns:
object: to be used as a transformer method as part of Pipeline()
"""
self.threshold = kwargs.pop('neg_conf_val', -10.0)
self.kwargs = kwargs
def transform(self, X, y):
"""
Uses LocalOutlierFactor class to subselect data based on some threshold
Returns:
ndarray: subsampled data
Notes:
X should be of shape (n_samples, n_features)
"""
X = np.asarray(X)
y = np.asarray(y)
lcf = LocalOutlierFactor(**self.kwargs)
lcf.fit(X)
return (X[lcf.negative_outlier_factor_ > self.threshold, :],
y[lcf.negative_outlier_factor_ > self.threshold])
def fit(self, *args, **kwargs):
return self
Then create a pipeline as:
pipe = Pipeline([('outliers', OutlierExtraction()), ...])