Let\'s say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by u
I didn't find the accepted answer very clear, so here is my solution for others.
Basically, the idea is making a new class based on BaseEstimator
and TransformerMixin
The following is a feature selector based on percentage of NAs within a column. The perc
value corresponds to the percentage of NAs.
from sklearn.base import TransformerMixin, BaseEstimator
class NonNAselector(BaseEstimator, TransformerMixin):
"""Extract columns with less than x percentage NA to impute further
in the line
Class to use in the pipline
-----
attributes
fit : identify columns - in the training set
transform : only use those columns
"""
def __init__(self, perc=0.1):
self.perc = perc
self.columns_with_less_than_x_na_id = None
def fit(self, X, y=None):
self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
return self
def transform(self, X, y=None, **kwargs):
return X[self.columns_with_less_than_x_na_id]
def get_params(self, deep=False):
return {"perc": self.perc}
I just want to post my solution for completeness, and maybe it is useful to one or the other:
class ColumnExtractor(object):
def transform(self, X):
cols = X[:,2:4] # column 3 and 4 are "extracted"
return cols
def fit(self, X, y=None):
return self
Then, it can be used in the Pipeline
like so:
clf = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', ColumnExtractor()),
('classification', GaussianNB())
])
And for a more general solution ,if you want to select and stack multiple columns, you can basically use the following Class as follows:
import numpy as np
class ColumnExtractor(object):
def __init__(self, cols):
self.cols = cols
def transform(self, X):
col_list = []
for c in self.cols:
col_list.append(X[:, c:c+1])
return np.concatenate(col_list, axis=1)
def fit(self, X, y=None):
return self
clf = Pipeline(steps=[
('scaler', StandardScaler()),
('dim_red', ColumnExtractor(cols=(1,3))), # selects the second and 4th column
('classification', GaussianNB())
])
You can use the following custom transformer to select the columns specified:
#Custom Transformer that extracts columns passed as an argument to its constructor
class FeatureSelector( BaseEstimator, TransformerMixin ):
#Class Constructor
def __init__( self, feature_names ):
self._feature_names = feature_names
#Return self nothing else to do here
def fit( self, X, y = None ):
return self
#Method that describes what we need this transformer to do
def transform( self, X, y = None ):
return X[ self._feature_names ]`
Here feature_names is the list of features which you want to select For more details, you can refer to this link [1]: https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65
If you want to use the Pipeline
object, then yes, the clean way is to write a transformer object. The dirty way to do this is
select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4
and use select_3_and_4
as you had it in your pipeline. You can evidently also write a class.
Otherwise, you could also just give X_train[:, 2:4]
to your pipeline if you know that the other features are irrelevant.
Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest
using sklearn.feature_selection.f_classif
or sklearn.feature_selection.f_regression
with e.g. k=2
in your case.
Adding on Sebastian Raschka's and eickenberg's answers, the requirements a transformer object should hold are specified in scikit-learn's documentation.
There are several more requirements than just having fit and transform, if you want the estimator to usable in parameter estimation, such as implementing set_params.