Given is a simple CSV file:
A,B,C
Hello,Hi,0
Hola,Bueno,1
Obviously the real dataset is far more complex than this, but this one reproduces
Well, there are important differences between how OneHot Encoding and Label Encoding work :
int
. In this case, the 1st class found will be coded as 1
, the 2nd as 2
, ...
But this encoding creates an issue.Let's take the example of a variable Animal = ["Dog", "Cat", "Turtle"]
.
If you use Label Encoder on it, Animal
will be [1, 2, 3]
. If you parse it to your machine learning model, it will interpret Dog
is closer than Cat
, and farther than Turtle
(because distance between 1
and 2
is lower than distance between 1
and 3
).
Label encoding is actually excellent when you have ordinal variable.
For example, if you have a value Age = ["Child", "Teenager", "Young Adult", "Adult", "Old"]
,
then using Label Encoding is perfect. Child
is closer than Teenager
than it is from Young Adult
. You have a natural order on your variables
Let's take back the previous example of Animal = ["Dog", "Cat", "Turtle"]
.
It will create as much variable as classes you encounter. In my example, it will create 3 binary variables : Dog, Cat and Turtle
. Then if you have Animal = "Dog"
, encoding will make it Dog = 1, Cat = 0, Turtle = 0
.
Then you can give this to your model, and he will never interpret that Dog
is closer from Cat
than from Turtle
.
But there are also cons to OneHotEncoding. If you have a categorical variable encountering 50 kind of classes
eg : Dog, Cat, Turtle, Fish, Monkey, ...
then it will create 50 binary variables, which can cause complexity issues. In this case, you can create your own classes and manually change variable
eg : regroup Turtle, Fish, Dolphin, Shark
in a same class called Sea Animals
and then appy a OneHotEncoding.
I had a similar issue and found that pandas.get_dummies() solved the problem. Specifically, it splits out columns of categorical data into sets of boolean columns, one new column for each unique value in each input column. In your case, you would replace train_x = test[cols]
with:
train_x = pandas.get_dummies(test[cols])
This transforms the train_x Dataframe into the following form, which RandomForestClassifier can accept:
C A_Hello A_Hola B_Bueno B_Hi
0 0 1 0 0 1
1 1 0 1 1 0
Indeed a one-hot encoder will work just fine here, convert any string and numerical categorical variables you want into 1's and 0's this way and random forest should not complain.
You may not pass str
to fit this kind of classifier.
For example, if you have a feature column named 'grade' which has 3 different grades:
A,B and C.
you have to transfer those str
"A","B","C" to matrix by encoder like the following:
A = [1,0,0]
B = [0,1,0]
C = [0,0,1]
because the str
does not have numerical meaning for the classifier.
In scikit-learn, OneHotEncoder
and LabelEncoder
are available in inpreprocessing
module.
However OneHotEncoder
does not support to fit_transform()
of string.
"ValueError: could not convert string to float" may happen during transform.
You may use LabelEncoder
to transfer from str
to continuous numerical values. Then you are able to transfer by OneHotEncoder
as you wish.
In the Pandas dataframe, I have to encode all the data which are categorized to dtype:object
. The following code works for me and I hope this will help you.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for column_name in train_data.columns:
if train_data[column_name].dtype == object:
train_data[column_name] = le.fit_transform(train_data[column_name])
else:
pass
As your input is in string you are getting value error message use countvectorizer it will convert data set in to sparse matrix and train your ml algorithm you will get the result
You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this.
There are several classes that can be used :
Personally I have post almost the same question on StackOverflow some time ago. I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.