AttributeError when using ColumnTransformer into a pipeline

百般思念 提交于 2019-12-05 21:39:34

ColumnTransformer returns numpy.array, so it can't have column attribute (as indicated by your error).

If I may suggest a different solution, use pandas for both of your tasks, it will be easier.

Step 1 - replacing missing values

To replace missing value in a subset of columns with missing_value string use this:

dataframe[["PoolQC", "Alley"]].fillna("missing_value", inplace=True)

For the rest (imputing with mean of each column), this will work perfectly:

dataframe[["Street", "MSZoning", "LandContour"]].fillna(
    dataframe[["Street", "MSZoning", "LandContour"]].mean(), inplace=True
)

Step 2 - one hot encoding and categorical variables

pandas provides get_dummies, which returns pandas Dataframe, unlike ColumnTransfomer, code for this would be:

encoded = pd.get_dummies(dataframe[['MSZoning', 'LandContour']], drop_first=True)
pd.dropna(['MSZoning', 'LandContour'], axis=columns, inplace=True)
dataframe = dataframe.join(encoded)

For ordinal variables and their encoding I would suggest you to look at this SO answer (unluckily some manual mapping would be needed in this case).

If you want to use transformer anyway

Get np.array from the dataframe using values attribute, pass it through the pipeline and recreate columns and indices from the array like this:

pd.DataFrame(data=your_array, index=np.arange(len(your_array)), columns=["A", "B"])

There is one caveat of this aprroach though; you will not know the names of custom created one-hot-encoded columns (the pipeline will not do this for you).

Additionally, you could get the names of columns from sklearn's transforming objects (e.g. using categories_ attribute), but I think it would break the pipeline (someone correct me if I'm wrong).

Option #2

use the make_pipeline function

(Had the same Error, found this answer, than found this: Introducing the ColumnTransformer)

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cat_columns_fill_miss = ['PoolQC', 'Alley']
cat_columns_fill_freq = ['Street', 'MSZoning', 'LandContour']
cat_columns_ord = ['Street', 'Alley', 'PoolQC']
ord_mapping = [['Pave', 'Grvl'],                          # Street
               ['missing_value', 'Pave', 'Grvl'],         # Alley
               ['missing_value', 'Fa', 'TA', 'Gd', 'Ex']  # PoolQC
               ]
cat_columns_onehot = ['MSZoning', 'LandContour']

imputer_cat_pipeline = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
    (make_pipeline(SimpleImputer(strategy='most_frequent'), cat_columns_fill_freq),
)

encoder_cat_pipeline = make_column_transformer(
    (OrdinalEncoder(categories=ord_mapping), cat_columns_ord),
    (OneHotEncoder(), cat_columns_onehot),
)

cat_pipeline = Pipeline([
    ('imp_cat', imputer_cat_pipeline),
    ('cat_encoder', encoder_cat_pipeline),
])

In my own pipelines i do not have overlapping preprocessing in the column space. So i am not sure, how the transformation and than the "outer pipelining" works.

However, the important part is to use make_pipeline around the SimpleImputer to use it in a pipeline properly:

imputer_cat_pipeline = make_column_transformer(
    (make_pipeline(SimpleImputer(strategy='constant'), cat_columns_fill_miss),
)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!