问题
I want to use sklearn.compose.ColumnTransformer
consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:
log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
transformers=[
('num', impute.SimpleImputer() , ['a', 'b']),
('log', log_transformer, ['b', 'c']),
('scale', p.StandardScaler(), ['a', 'b', 'c'])
]).fit_transform(df)
So, I want to use SimpleImputer
for 'a'
, 'b'
, then log
for 'b'
, 'c'
, and then StandardScaler
for 'a'
, 'b'
, 'c'
.
But:
- I get array of
(4, 7)
shape. - I still get
Nan
ina
andb
columns.
So, how can I use ColumnTransformer
for different columns in the manner of Pipeline
?
UPD:
pipe_1 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])
pipe_2 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])
pipe_3 = pipeline.Pipeline(steps=[
('scl', p.StandardScaler()),
])
# in the real situation I don't know exactly what cols these arrays contain, so they are not static:
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']
proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
('1', pipe_1, cols_1),
('2', pipe_2, cols_2),
('3', pipe_3, cols_3),
])
proc.fit_transform(df).T
Output:
array([[ 1. , 2. , 42. , 4. ],
[ 1. , 24. , 3. , 4. ],
[-1.06904497, -0.26726124, nan, 1.33630621],
[-1.33630621, nan, 0.26726124, 1.06904497],
[-1.34164079, -0.4472136 , 0.4472136 , 1.34164079]])
I understood why I have cols duplicates, nans
and not scaled values, but how can I fix this in the correct way when cols are not static?
UPD2:
A problem may arise when the columns change their order. So, I want to use FunctionTransformer
for columns selection:
def select_col(X, cols=None):
return X[cols]
ct1 = compose.make_column_transformer(
(p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
remainder='passthrough'
)
ct1.fit(df)
But get this output:
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed
How can I fix it?
回答1:
The intended usage of ColumnTransformer
is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:
First approach:
pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
('log', log_transformer),
('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
('a', pipe_a, ['a']),
('b', pipe_b, ['b']),
('c', pipe_c, ['c'])]
)
This second one actually won't work, because the ColumnTransformer
will rearrange the columns and forget the names*, so that the later ones will fail or apply to the wrong columns. When sklearn finalizes how to pass along dataframes or feature names, this may be salvaged, or you may be able to tweak it for your specific usecase now. (* ColumnTransformer does already have a get_feature_names
, but the actual data passed through the pipeline doesn't have that information.)
imp_tfm = ColumnTransformer(
transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
remainder='passthrough'
)
log_tfm = ColumnTransformer(
transformers=[('log', log_transformer, ['b', 'c'])],
remainder='passthrough'
)
scl_tfm = ColumnTransformer(
transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
)
proc = Pipeline(steps=[
('imp', imp_tfm),
('log', log_tfm),
('scale', scl_tfm)]
)
Third, there may be a way to use the Pipeline
slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:
pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')
(Without cloning or otherwise deep-copying pipe_b
, the last line would change both pipe_c
and pipe_b
. The slicing mechanism returns a copy, so pipe_a
doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2]
doesn't work, but you can set the individual slices as I've done above to "passthrough"
to disable them.)
回答2:
We can use little columns_name_to_index
hack to convert column names to index and then we can pass the dataframe to the pipeline like this:
def columns_name_to_index(arr_of_names, df):
return [df.columns.get_loc(c) for c in arr_of_names if c in df]
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']
ct1 = compose.ColumnTransformer(remainder='passthrough', transformers=[
(impute.SimpleImputer(strategy='constant', fill_value=42), columns_name_to_index(cols_1, df)),
(impute.SimpleImputer(strategy='constant', fill_value=24), columns_name_to_index(cols_2, df)),
])
ct2 = compose.ColumnTransformer(remainder='passthrough', transformers=[
(p.StandardScaler(), columns_name_to_index(cols_3, df)),
])
pipe = pipeline.Pipeline(steps=[
('ct1', ct1),
('ct2', ct2),
])
pipe.fit_transform(df).T
来源:https://stackoverflow.com/questions/62225230/consistent-columntransformer-for-intersecting-lists-of-columns