sklearn pipeline - how to apply different transformations on different columns

会有一股神秘感。 提交于 2019-11-26 14:08:18

问题


I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).

I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.


回答1:


The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.

Important notes:

  • You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model

  • You need to initialize FunctionTransformer with validate=False

Something like this:

from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer

def get_text_cols(df):
    return df[['name', 'fruit']]

def get_num_cols(df):
    return df[['height','age']]

vec = make_union(*[
    make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
    make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])



回答2:


Since v0.20, you can use ColumnTransformer to accomplish this.



来源:https://stackoverflow.com/questions/39001956/sklearn-pipeline-how-to-apply-different-transformations-on-different-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!