问题
I just started learning machine learning, when practicing one of the task, I am getting value error, but I followed the same steps as the instructor does.
I am getting value error, please help.
dff
Country Name
0 AUS Sri
1 USA Vignesh
2 IND Pechi
3 USA Raj
First I performed labelencoding,
X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])
out:
X
array([[0, 'Sri'],
[2, 'Vignesh'],
[1, 'Pechi'],
[2, 'Raj']], dtype=object)
then performed One hot encoding for the same X
onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
I am getting the below error:
ValueError Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
1900 """
1901 return _transform_selected(X, self._fit_transform,
-> 1902 self.categorical_features, copy=True)
1903
1904 def _transform(self, X):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
1695 X : array or sparse matrix, shape=(n_samples, n_features_new)
1696 """
-> 1697 X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
1698
1699 if isinstance(selected, six.string_types) and selected == "all":
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float: 'Raj'
Please edit my question is anything wrong, thanks in advance!
回答1:
You can go directly to OneHotEncoding now without using the LabelEncoder, and as we move toward version 0.22 many might want to do things this way to avoid warnings and potential errors (see DOCS and EXAMPLES).
Example code 1 where ALL columns are encoded and where the categories are explicitly specified:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
countries = np.unique(X[:,0])
names = np.unique(X[:,1])
ohe = OneHotEncoder(categories=[countries, names])
X = ohe.fit_transform(X).toarray()
print (X)
Output for code example 1:
[[1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 0.]
[0. 0. 1. 0. 1. 0. 0.]]
Example code 2 showing the 'auto' option for specification of categories:
The first 3 columns encode the country names, the last four the personal names.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()
print (X)
Output for code example 2 (same as for 1):
[[1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 0.]
[0. 0. 1. 0. 1. 0. 0.]]
Example code 3 where only the first column is one hot encoded:
Now, here's the unique part. What if you only need to One Hot Encode a specific column for your data?
(Note: I've left the last column as strings for easier illustration. In reality it makes more sense to do this WHEN the last column was already numerical).
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
countries = np.unique(X[:,0])
names = np.unique(X[:,1])
ohe = OneHotEncoder(categories=[countries]) # specify ONLY unique country names
tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()
X = np.append(tmp, names.reshape(-1,1), axis=1)
print (X)
Output for code example 3:
[[1.0 0.0 0.0 'Pechi']
[0.0 0.0 1.0 'Raj']
[0.0 1.0 0.0 'Sri']
[0.0 0.0 1.0 'Vignesh']]
回答2:
Below implementation should work well. Note that the input of onehotencoder
fit_transform
must not be 1-rank array and also output is sparse and we have used to_array()
to expand it.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
le = LabelEncoder()
X_num = le.fit_transform(X[:,0]).reshape(-1,1)
ohe = OneHotEncoder()
X_num = ohe.fit_transform(X_num)
print (X_num.toarray())
X[:,0] = X_num
print (X)
回答3:
An alternative if you do want to encode multiple categorical features is to use a Pipeline with a FeatureUnion and a couple custom Transformers.
First need two transformers - one for selecting a single column and one for making LabelEncoder usable in a Pipeline (The fit_transform method only takes X, it needs to take an optional y to work in a Pipeline).
from sklearn.base import BaseEstimator, TransformerMixin
class SingleColumnSelector(TransformerMixin, BaseEstimator):
def __init__(self, column):
self.column = column
def transform(self, X, y=None):
return X[:, self.column].reshape(-1, 1)
def fit(self, X, y=None):
return self
class PipelineAwareLabelEncoder(TransformerMixin, BaseEstimator):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return LabelEncoder().fit_transform(X).reshape(-1, 1)
Next create a Pipeline (or just a FeatureUnion) which has 2 branches - one for each of the categorical columns. Within each select 1 column, encode the labels and then one hot encode.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
pipeline = Pipeline([(
'encoded_features',
FeatureUnion([('countries',
make_pipeline(
SingleColumnSelector(0),
PipelineAwareLabelEncoder(),
OneHotEncoder()
)),
('names', make_pipeline(
SingleColumnSelector(1),
PipelineAwareLabelEncoder(),
OneHotEncoder()
))
]))
])
Finally run your full dataframe through the Pipeline - it will one hot encode each column separately and concatenate at the end.
df = pd.DataFrame([["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]], columns=['Country', 'Name'])
X = df.values
transformed_X = pipeline.fit_transform(X)
print(transformed_X.toarray())
Which returns (first 3 columns are the countries, second 4 are the names)
[[ 1. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0. 0. 1.]
[ 0. 1. 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 1. 0. 0.]]
来源:https://stackoverflow.com/questions/47790854/how-to-perform-onehotencoding-in-sklearn-getting-value-error