Shape mismatch: if categories is an array, it has to be of shape (n_features,)

问题

Here is the code I'm trying to execute to encode the values of the first column of my data set using dummy values.

import numpy as py
import matplotlib.pyplot as plt
import pandas as pd
 

DataSet = pd.read_csv('Data.csv')
x=DataSet.iloc[:, :-1].values
y=DataSet.iloc[:,3].values

from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=py.nan,strategy='mean')
imputer=imputer.fit(x[:, 1:3])
x[:, 1:3]=imputer.transform(x[:, 1:3])


from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(categories=[0])
x=onehotencoder.fit_transform(x).toarray()

Here's the data I'm working on

France  44.0    72000.0
Spain   27.0    48000.0
Germany 30.0    54000.0
Spain   38.0    61000.0
Germany 40.0    63777.7
France  35.0    58000.0
Spain   38.777  52000.0
France  48.0    79000.0
Germany 50.0    83000.0
France  37.0    67000.0

I'm getting a error stating

Shape mismatch: if categories is an array, it has to be of shape (n_features,).

Can anyone help me fix this?

回答1:

Your second doesn't seem to be a categorical features, you should only one_hot_encode features which can take a finite number of discrete value. Like the first column which can only take a limited number of value ('spain', 'germany', 'france') If you only encode de the first column you can do:

from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(categories=[['France','Germany','Spain']])
x_1=onehotencoder.fit_transform(x[:,0].reshape(-1, 1)).toarray()
x = np.concatenate([x_1,x[:,1:]], axis=1)

and then your data will be in the form:

France Germany Spain score
1      0       0     44.0
0      0       1     27.0
...

Also, You only have 3 columns on your data but you're calling the fourth column with y=DataSet.iloc[:,3].values (first column start at index 0 -> .iloc[:,3] should give 4th column, then.

来源：https://stackoverflow.com/questions/62633492/shape-mismatch-if-categories-is-an-array-it-has-to-be-of-shape-n-features

标签

python

machine-learning

Anaconda

data-science