How to do Onehotencoding in Sklearn Pipeline

99封情书 提交于 2019-12-04 08:35:14

问题


I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can generate a PMML-file later on.

This is the code to create a mapper. The categorical variables I would like to encode are stored in a list called 'dummies'.

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in dummies] +
    [(d, OneHotEncoder()) for d in dummies]
)

And this is the code to create a pipeline, including the mapper and linear regression.

from sklearn2pmml import PMMLPipeline
from sklearn.linear_model import LinearRegression

lm = PMMLPipeline([("mapper", mapper),
                   ("regressor", LinearRegression())])

When I now try to fit (with 'features' being a dataframe, and 'targets' a series), it gives an error 'could not convert string to float'.

lm.fit(features, targets)

Anyone who can help me out? I am desperate for working pipelines including the preprocessing of data... Thanks in advance!


回答1:


OneHotEncoder doesn't support string features, and with [(d, OneHotEncoder()) for d in dummies] you are applying it to all dummies columns. Use LabelBinarizer instead:

mapper = DataFrameMapper(
    [(d, LabelBinarizer()) for d in dummies]
)

An alternative would be to use the LabelEncoder with a second OneHotEncoder step.

mapper = DataFrameMapper(
    [(d, LabelEncoder()) for d in dummies]
)

lm = PMMLPipeline([("mapper", mapper),
                   ("onehot" OnehotEncoder()),
                   ("regressor", LinearRegression())])


来源:https://stackoverflow.com/questions/42204250/how-to-do-onehotencoding-in-sklearn-pipeline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!