One-Hot-Encode categorical variables and scale continuous ones simultaneouely

前端 未结 4 1387
情书的邮戳
情书的邮戳 2020-12-24 14:04

I\'m confused because it\'s going to be a problem if you first do OneHotEncoder and then StandardScaler because the scaler will also scale the colu

相关标签:
4条回答
  • 2020-12-24 14:21

    Sure thing. Just separately scale and one-hot-encode the separate columns as needed:

    # Import libraries and download example data
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    
    dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
    print(dataset.head(5))
    
    # Define which columns should be encoded vs scaled
    columns_to_encode = ['rank']
    columns_to_scale  = ['gre', 'gpa']
    
    # Instantiate encoder/scaler
    scaler = StandardScaler()
    ohe    = OneHotEncoder(sparse=False)
    
    # Scale and Encode Separate Columns
    scaled_columns  = scaler.fit_transform(dataset[columns_to_scale]) 
    encoded_columns =    ohe.fit_transform(dataset[columns_to_encode])
    
    # Concatenate (Column-Bind) Processed Columns Back Together
    processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)
    
    0 讨论(0)
  • 2020-12-24 14:26

    Scikit-learn from version 0.20 provides sklearn.compose.ColumnTransformer to do Column Transformer with Mixed Types. You can scale the numeric features and one-hot encode the categorical ones together. Below is the offical example(you can find the code here ):

    # Author: Pedro Morales <part.morales@gmail.com>
    #
    # License: BSD 3 clause
    
    from __future__ import print_function
    
    import pandas as pd
    import numpy as np
    
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, GridSearchCV
    
    np.random.seed(0)
    
    # Read data from Titanic dataset.
    titanic_url = ('https://raw.githubusercontent.com/amueller/'
                   'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
    data = pd.read_csv(titanic_url)
    
    # We will train our classifier with the following features:
    # Numeric Features:
    # - age: float.
    # - fare: float.
    # Categorical Features:
    # - embarked: categories encoded as strings {'C', 'S', 'Q'}.
    # - sex: categories encoded as strings {'female', 'male'}.
    # - pclass: ordinal integers {1, 2, 3}.
    
    # We create the preprocessing pipelines for both numeric and categorical data.
    numeric_features = ['age', 'fare']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])
    
    categorical_features = ['embarked', 'sex', 'pclass']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    
    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression(solver='lbfgs'))])
    
    X = data.drop('survived', axis=1)
    y = data['survived']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))
    

    Caution: this method is EXPERIMENTAL, some behaviors may change between releases without deprecation.

    0 讨论(0)
  • 2020-12-24 14:28

    There are presently numerous methods to achieve the outcome required by the OP. 3 ways to do this are

    1. np.concatenate() - see this answer to the OP's question, already posted

    2. scikit-learn's ColumnTransformer

      • originally suggested in this SO answer to the OP's question
    3. scikit-learn's FeatureUnion

      • also shown in this SO answer

    Using the example posted by @Max Power here, below is a minimum working snippet that does what the OP is looking for and brings together the transformed columns into a single Pandas dataframe. The output of all 3 approaches is shown

    The common code for all 3 methods is

    import numpy as np
    import pandas as pd
    
    # Import libraries and download example data
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    
    dataset = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
    
    # Define which columns should be encoded vs scaled
    columns_to_encode = ['rank']
    columns_to_scale  = ['gre', 'gpa']
    
    # Instantiate encoder/scaler
    scaler = StandardScaler()
    ohe    = OneHotEncoder(sparse=False)
    

    Method 1. see code here. To show the output, can use

    print(pd.DataFrame(processed_data).head())
    

    Output of Method 1.

              0         1    2    3    4    5
    0 -1.800263  0.579072  0.0  0.0  1.0  0.0
    1  0.626668  0.736929  0.0  0.0  1.0  0.0
    2  1.840134  1.605143  1.0  0.0  0.0  0.0
    3  0.453316 -0.525927  0.0  0.0  0.0  1.0
    4 -0.586797 -1.209974  0.0  0.0  0.0  1.0
    

    Method 2.

    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    
    
    p = Pipeline(
        [("coltransformer", ColumnTransformer(
            transformers=[
                ("assessments", Pipeline([("scale", scaler)]), columns_to_scale),
                ("ranks", Pipeline([("encode", ohe)]), columns_to_encode),
            ]),
        )]
    )
    
    print(pd.DataFrame(p.fit_transform(dataset)).head())
    

    Output of Method 2.

              0         1    2    3    4    5
    0 -1.800263  0.579072  0.0  0.0  1.0  0.0
    1  0.626668  0.736929  0.0  0.0  1.0  0.0
    2  1.840134  1.605143  1.0  0.0  0.0  0.0
    3  0.453316 -0.525927  0.0  0.0  0.0  1.0
    4 -0.586797 -1.209974  0.0  0.0  0.0  1.0
    

    Method 3.

    from sklearn.pipeline import Pipeline
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.pipeline import FeatureUnion
    
    
    class ItemSelector(BaseEstimator, TransformerMixin):
        def __init__(self, key):
            self.key = key
        def fit(self, x, y=None):
            return self
        def transform(self, df):
            return df[self.key]
    
    p = Pipeline([("union", FeatureUnion(
        transformer_list=[
            ("assessments", Pipeline([
                ("selector", ItemSelector(key=columns_to_scale)),
                ("scale", scaler)
                ]),
            ),
            ("ranks", Pipeline([
                ("selector", ItemSelector(key=columns_to_encode)),
                ("encode", ohe)
                ]),
            ),
        ]))
    ])
    
    print(pd.DataFrame(p.fit_transform(dataset)).head())
    

    Output of Method 3.

              0         1    2    3    4    5
    0 -1.800263  0.579072  0.0  0.0  1.0  0.0
    1  0.626668  0.736929  0.0  0.0  1.0  0.0
    2  1.840134  1.605143  1.0  0.0  0.0  0.0
    3  0.453316 -0.525927  0.0  0.0  0.0  1.0
    4 -0.586797 -1.209974  0.0  0.0  0.0  1.0
    

    Explanation

    1. Method 1. is already explained.

    2. Methods 2. and 3. accept the full dataset but only perform specific actions on subsets of the data. The modified/processed subsets are brought together (combined) into the final output.

    Details

    pandas==0.23.4
    numpy==1.15.2
    scikit-learn==0.20.0
    

    Additional Notes

    The 3 methods shown here are probably not the only possibilities....I am sure there are other methods to do this.

    SOURCE USED

    Updated link to binary.csv dataset

    0 讨论(0)
  • 2020-12-24 14:32

    Can't get your point as OneHotEncoder is used for nominal data, and StandardScaler is used for numeric data. So you shouldn't use them together for your data.

    0 讨论(0)
提交回复
热议问题