问题
I put together the following function that read csv, train the model and predict the request data.
I've got the following ValueError : Column ordering must be equal for fit and for transform when using the remainder keyword
The training data and the data used for prediction has exact the same number of column , e.g., 15. I am not sure how the "ordering" of the column could have changed.
~/.local/lib/python3.5/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
417 Xt = X
418 for _, name, transform in self._iter(with_final=False):
--> 419 Xt = transform.transform(Xt)
420 return self.steps[-1][-1].predict(Xt, **predict_params)
421
~/.local/lib/python3.5/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
581 if (n_cols_transform >= n_cols_fit and
582 any(X.columns[:n_cols_fit] != self._df_columns)):
--> 583 raise ValueError('Column ordering must be equal for fit '
584 'and for transform when using the '
585 'remainder keyword')
ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword
Function:
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
#Putting data transformation and the model in a pipeline
rf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier(
n_estimators=500,
criterion="gini",
max_features="sqrt",
min_samples_leaf=4))])
rf.fit(X_train, y_train)
request_data = {'A': [request.A],
'B': [request.B],
'C': [request.C],
'D': [request.D],
'E': [request.E],
'F': [request.F],
'G': [request.G],
'H': [request.H],
'I': [request.I],
'J': [request.J],
'K': [request.K],
'L': [request.L],
'M': [request.M],
'N': [request.N],
'O': [request.O]}
df_resp = pd.DataFrame(data=request_data)
response = rf.predict(df_resp)
output = {"Safety Rating": response[0]}
return output
回答1:
What I understand from the error message is that X_train.columns
and df_resp.columns
are not the same but .predict()
needs them to be.
In order to force this equality you could pass the column list of X_train
as an argument when creating the dataframe:
pd.DataFrame(data=request_data, columns=X_train.columns)
回答2:
You can use following generic function in order to sort columns correctly :
def rearrange_columns(df, first_order="categorical"):
"""
ColumnTransformer of scikit-learn Pipeline changes the order of the dataframe columns.
Use this function to reorder the features columns to be consistent with the ouptut of the pipeline
"""
cat_ix = [ii for ii, col in enumerate(df.columns.values) if df[col].dtypes=="object"]
num_ix = [ii for ii, col in enumerate(df.columns.values) if ii not in cat_ix]
new_order = cat_ix + num_ix if first_order == "categorical" else num_ix + cat_ix
return [df.columns.values[ii] for ii in new_order]
来源:https://stackoverflow.com/questions/61001934/python-valueerror-columntransformer-column-ordering-is-not-equal