One-Hot-Encode categorical variables and scale continuous ones simultaneouely

前端 未结 4 1390
情书的邮戳
情书的邮戳 2020-12-24 14:04

I\'m confused because it\'s going to be a problem if you first do OneHotEncoder and then StandardScaler because the scaler will also scale the colu

4条回答
  •  有刺的猬
    2020-12-24 14:26

    Scikit-learn from version 0.20 provides sklearn.compose.ColumnTransformer to do Column Transformer with Mixed Types. You can scale the numeric features and one-hot encode the categorical ones together. Below is the offical example(you can find the code here ):

    # Author: Pedro Morales 
    #
    # License: BSD 3 clause
    
    from __future__ import print_function
    
    import pandas as pd
    import numpy as np
    
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, GridSearchCV
    
    np.random.seed(0)
    
    # Read data from Titanic dataset.
    titanic_url = ('https://raw.githubusercontent.com/amueller/'
                   'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
    data = pd.read_csv(titanic_url)
    
    # We will train our classifier with the following features:
    # Numeric Features:
    # - age: float.
    # - fare: float.
    # Categorical Features:
    # - embarked: categories encoded as strings {'C', 'S', 'Q'}.
    # - sex: categories encoded as strings {'female', 'male'}.
    # - pclass: ordinal integers {1, 2, 3}.
    
    # We create the preprocessing pipelines for both numeric and categorical data.
    numeric_features = ['age', 'fare']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])
    
    categorical_features = ['embarked', 'sex', 'pclass']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    
    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression(solver='lbfgs'))])
    
    X = data.drop('survived', axis=1)
    y = data['survived']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))
    

    Caution: this method is EXPERIMENTAL, some behaviors may change between releases without deprecation.

提交回复
热议问题