To discretize categorical features I\'m using a LabelEncoder and OneHotEncoder. I know that LabelEncoder maps data alphabetically, but how does OneHotEncoder map data?
One hot encoding means that you create vectors of one and zero. So the order does not matter.
In sklearn
, first you need to encode the categorical data to numerical data and then feed them to the OneHotEncoder
, for example:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
S = np.array(['b','a','c'])
le = LabelEncoder()
S = le.fit_transform(S)
print(S)
ohe = OneHotEncoder()
one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
print(one_hot)
which results in:
[1 0 2]
[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]]
But pandas
directly convert the categorical data:
import pandas as pd
S = pd.Series( {'A': ['b', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)
which outputs:
A [b, a, c]
dtype: object
a b c
0 0 1 0
1 1 0 0
2 0 0 1
as you can see during the mapping, for each categorical feature a vector is created. The elements of the vectors are one at the location of the categorical feature and zero every where else. Here is an example when there are only two categorical features in the series:
S = pd.Series( {'A': ['a', 'a', 'c']})
print(S)
one_hot = pd.get_dummies(S['A'])
print(one_hot)
results in:
A [a, a, c]
dtype: object
a c
0 1 0
1 1 0
2 0 1
EDITS TO ANSWER THE NEW QUESTION
Lets start with this question: Why do we perform a one hot encoding? IF you encode a categorical data like ['a','b','c'] to integers [1,2,3] (e.g. with LableEncoder), in addition to encoding your categorical data, you would give them some weights as 1 < 2 < 3. This way of encoding is fine for some machine learning techniques like RandomForest. But many machine learning techniques would assume that in this case 'a' < 'b' < 'c' if you encoded them with 1, 2, 3 respectively. In order to avoid this issue, you can create a column for each unique categorical variable in your data. In other words, you create a new feature for each categorical variables (here one column for 'a' one for 'b' and one for 'c'). The values in these new columns are set to one if the variable was in that index and zero in other places.
For the array in your example, the one hot encoder would be:
features -> A B C D
[[ 1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 0. 0. 0. 1.]]
You have 4 categorical variables "A", "B", "C", "D". Therefore, OneHotEncoder would populate your (4,) array to (4,4) to have one vector (or column) for each categorical variable (which will be your new features). Since "A" the 0 element of your array, the index 0 of your first column is set to 1 and the rest are set to 0. Similarly, the second vector (column) belongs to feature "B" and since "B" was in the index 1 of your array, the index 1 of the "B" vector is set to 1 and the rest are set to zero. The same applies for the rest of features.
Let me change your array. Maybe it can help you to better understand how label encoder works:
S = np.array(['D', 'B','C','A'])
S = le.fit_transform(S)
enc = OneHotEncoder()
encModel = enc.fit_transform(S.reshape(-1,1)).toarray()
print(encModel)
now the result is the following. Here the first column is 'A' and since it was last element of your array (index = 3), the last element of first column would be 1.
features -> A B C D
[[ 0. 0. 0. 1.]
[ 0. 1. 0. 0.]
[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]]
Regarding your pandas dataframe, dataFeat
, you are wrong even in the first step about how LableEncoder
works. When you apply LableEncoder
it fits to each column at the time and encode it; then, it goes to the next column and make a new fit to that column. Here is what you should get:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'Feat1': ['A','B','D','C'],'Feat2':['B','B','D','C'],'Feat3':['A','C','A','A'],
'Feat4':['A','C','A','A'],'Feat5':['A','C','B','A']})
print('my data frame:')
print(df)
le = LabelEncoder()
intIndexed = df.apply(le.fit_transform)
print('Encoded data frame')
print(intIndexed)
results:
my data frame:
Feat1 Feat2 Feat3 Feat4 Feat5
0 A B A A A
1 B B C C C
2 D D A A B
3 C C A A A
Encoded data frame
Feat1 Feat2 Feat3 Feat4 Feat5
0 0 0 0 0 0
1 1 0 1 1 2
2 3 2 0 0 1
3 2 1 0 0 0
Note that in the first column Feat1
'A' is encoded to 0 but in second column Feat2
the 'B' element is 0. This happens since LableEncoder
fits to each column and transform it separately. Note that in your second column among ('B', 'C', 'D') the variable 'B' is alphabetically superior.
And finally, here is what you are looking for with sklearn
:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
label_encoder = LabelEncoder()
data_lable_encoded = df.apply(label_encoder.fit_transform).as_matrix()
data_feature_onehot = encoder.fit_transform(data_lable_encoded).toarray()
print(data_feature_onehot)
which gives you:
[[ 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0.]]
if you use pandas
, you can compare the results and hopefully gives you a better intuition:
encoded = pd.get_dummies(df)
print(encoded)
result:
Feat1_A Feat1_B Feat1_C Feat1_D Feat2_B Feat2_C Feat2_D Feat3_A \
0 1 0 0 0 1 0 0 1
1 0 1 0 0 1 0 0 0
2 0 0 0 1 0 0 1 1
3 0 0 1 0 0 1 0 1
Feat3_C Feat4_A Feat4_C Feat5_A Feat5_B Feat5_C
0 0 1 0 1 0 0
1 1 0 1 0 0 1
2 0 1 0 0 1 0
3 0 1 0 1 0 0
which is exactly the same!