How to output Pandas object from sklearn pipeline

后端 未结 1 1232
礼貌的吻别
礼貌的吻别 2021-02-19 23:18

I have constructed a pipeline that takes a pandas dataframe that has been split into categorical and numerical columns. I am trying to run GridSearchCV on my results and ultimat

相关标签:
1条回答
  • 2021-02-20 00:08

    I would actually go for creating column names from the input. If your input is already divided into numerical an categorical you can use pd.get_dummies to get the number of different category for each categorical feature.

    then you can just create proper names for the columns as shown in the last part of this working example based on the question with some artificial data.

    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.linear_model import Ridge
    from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
    
    # create aritificial data
    numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
    numeric_features = ['x1', 'x2']
    categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
    categorical_features = ['cat1', 'cat2']
    
    X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
    X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
    y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})
    
    # impute and standardize numeric data 
    numeric_transformer = Pipeline([
        ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
        ('scale', StandardScaler())
    ])
    
    # impute and encode dummy variables for categorical data
    categorical_transformer = Pipeline([
        ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
        ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])
    
    clf = Pipeline([
        ('transform', preprocessor),
        ('ridge', Ridge())
    ])
    
    
    kf = KFold(n_splits=2, shuffle=True, random_state=44)
    cross_val_score(clf, X_train, y_train, cv=kf).mean()
    
    param_grid = {
        'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
    }
    
    gs = GridSearchCV(clf, param_grid, cv = kf)
    gs.fit(X_train, y_train)
    
    model = gs.best_estimator_
    predictions = model.fit(X_train, y_train).predict(X_test)
    print('coefficients : ',  model.named_steps['ridge'].coef_, '\n')
    
    # create column names for categorical hot encoded data
    columns_names_to_map = list(np.copy(numeric_features))
    columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
    columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)
    
    print('columns after preprocessing :', columns_names_to_map,  '\n')
    print('#'*80)
    print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
    print('#'*80)
    print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))
    

    the code above (in the end) prints out the following dataframe which is a map from parameter name to parameter value:

    0 讨论(0)
提交回复
热议问题