问题
Here is the code I'm trying to execute to encode the values of the first column of my data set using dummy values.
import numpy as py
import matplotlib.pyplot as plt
import pandas as pd
DataSet = pd.read_csv('Data.csv')
x=DataSet.iloc[:, :-1].values
y=DataSet.iloc[:,3].values
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=py.nan,strategy='mean')
imputer=imputer.fit(x[:, 1:3])
x[:, 1:3]=imputer.transform(x[:, 1:3])
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(categories=[0])
x=onehotencoder.fit_transform(x).toarray()
Here's the data I'm working on
France 44.0 72000.0
Spain 27.0 48000.0
Germany 30.0 54000.0
Spain 38.0 61000.0
Germany 40.0 63777.7
France 35.0 58000.0
Spain 38.777 52000.0
France 48.0 79000.0
Germany 50.0 83000.0
France 37.0 67000.0
I'm getting a error stating
Shape mismatch: if categories is an array, it has to be of shape (n_features,).
Can anyone help me fix this?
回答1:
Your second doesn't seem to be a categorical features, you should only one_hot_encode features which can take a finite number of discrete value. Like the first column which can only take a limited number of value ('spain', 'germany', 'france') If you only encode de the first column you can do:
from sklearn.preprocessing import OneHotEncoder
onehotencoder=OneHotEncoder(categories=[['France','Germany','Spain']])
x_1=onehotencoder.fit_transform(x[:,0].reshape(-1, 1)).toarray()
x = np.concatenate([x_1,x[:,1:]], axis=1)
and then your data will be in the form:
France Germany Spain score
1 0 0 44.0
0 0 1 27.0
...
Also, You only have 3 columns on your data but you're calling the fourth column with y=DataSet.iloc[:,3].values (first column start at index 0 -> .iloc[:,3] should give 4th column, then.
来源:https://stackoverflow.com/questions/62633492/shape-mismatch-if-categories-is-an-array-it-has-to-be-of-shape-n-features