OneHotEncoding Mapping

前端 未结 1 1500
渐次进展
渐次进展 2021-01-07 06:12

To discretize categorical features I\'m using a LabelEncoder and OneHotEncoder. I know that LabelEncoder maps data alphabetically, but how does OneHotEncoder map data?

相关标签:
1条回答
  • 2021-01-07 06:57

    One hot encoding means that you create vectors of one and zero. So the order does not matter. In sklearn, first you need to encode the categorical data to numerical data and then feed them to the OneHotEncoder, for example:

    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import OneHotEncoder
    
    S = np.array(['b','a','c'])
    le = LabelEncoder()
    S = le.fit_transform(S)
    print(S)
    ohe = OneHotEncoder()
    one_hot = ohe.fit_transform(S.reshape(-1,1)).toarray()
    print(one_hot)
    

    which results in:

    [1 0 2]
    
    [[ 0.  1.  0.]
     [ 1.  0.  0.]
     [ 0.  0.  1.]]
    

    But pandas directly convert the categorical data:

    import pandas as pd
    S = pd.Series( {'A': ['b', 'a', 'c']})
    print(S)
    one_hot = pd.get_dummies(S['A'])
    print(one_hot)
    

    which outputs:

    A    [b, a, c]
    dtype: object
    
       a  b  c
    0  0  1  0
    1  1  0  0
    2  0  0  1
    

    as you can see during the mapping, for each categorical feature a vector is created. The elements of the vectors are one at the location of the categorical feature and zero every where else. Here is an example when there are only two categorical features in the series:

    S = pd.Series( {'A': ['a', 'a', 'c']})
    print(S)
    one_hot = pd.get_dummies(S['A'])
    print(one_hot)
    

    results in:

    A    [a, a, c]
    dtype: object
    
       a  c
    0  1  0
    1  1  0
    2  0  1
    

    EDITS TO ANSWER THE NEW QUESTION

    Lets start with this question: Why do we perform a one hot encoding? IF you encode a categorical data like ['a','b','c'] to integers [1,2,3] (e.g. with LableEncoder), in addition to encoding your categorical data, you would give them some weights as 1 < 2 < 3. This way of encoding is fine for some machine learning techniques like RandomForest. But many machine learning techniques would assume that in this case 'a' < 'b' < 'c' if you encoded them with 1, 2, 3 respectively. In order to avoid this issue, you can create a column for each unique categorical variable in your data. In other words, you create a new feature for each categorical variables (here one column for 'a' one for 'b' and one for 'c'). The values in these new columns are set to one if the variable was in that index and zero in other places.

    For the array in your example, the one hot encoder would be:

    features ->  A   B   C   D 
    
              [[ 1.  0.  0.  0.]
               [ 0.  1.  0.  0.]
               [ 0.  0.  1.  0.]
               [ 0.  0.  0.  1.]]
    

    You have 4 categorical variables "A", "B", "C", "D". Therefore, OneHotEncoder would populate your (4,) array to (4,4) to have one vector (or column) for each categorical variable (which will be your new features). Since "A" the 0 element of your array, the index 0 of your first column is set to 1 and the rest are set to 0. Similarly, the second vector (column) belongs to feature "B" and since "B" was in the index 1 of your array, the index 1 of the "B" vector is set to 1 and the rest are set to zero. The same applies for the rest of features.

    Let me change your array. Maybe it can help you to better understand how label encoder works:

    S = np.array(['D', 'B','C','A'])
    S = le.fit_transform(S)
    enc = OneHotEncoder()
    encModel = enc.fit_transform(S.reshape(-1,1)).toarray()
    print(encModel)
    

    now the result is the following. Here the first column is 'A' and since it was last element of your array (index = 3), the last element of first column would be 1.

    features ->  A   B   C   D
              [[ 0.  0.  0.  1.]
               [ 0.  1.  0.  0.]
               [ 0.  0.  1.  0.]
               [ 1.  0.  0.  0.]]
    

    Regarding your pandas dataframe, dataFeat, you are wrong even in the first step about how LableEncoder works. When you apply LableEncoder it fits to each column at the time and encode it; then, it goes to the next column and make a new fit to that column. Here is what you should get:

    from sklearn.preprocessing import LabelEncoder
    df =  pd.DataFrame({'Feat1': ['A','B','D','C'],'Feat2':['B','B','D','C'],'Feat3':['A','C','A','A'],
                        'Feat4':['A','C','A','A'],'Feat5':['A','C','B','A']})
    print('my data frame:')
    print(df)
    
    le = LabelEncoder()
    intIndexed = df.apply(le.fit_transform)
    print('Encoded data frame')
    print(intIndexed)
    

    results:

    my data frame:
      Feat1 Feat2 Feat3 Feat4 Feat5
    0     A     B     A     A     A
    1     B     B     C     C     C
    2     D     D     A     A     B
    3     C     C     A     A     A
    
    Encoded data frame
       Feat1  Feat2  Feat3  Feat4  Feat5
    0      0      0      0      0      0
    1      1      0      1      1      2
    2      3      2      0      0      1
    3      2      1      0      0      0
    

    Note that in the first column Feat1 'A' is encoded to 0 but in second column Feat2 the 'B' element is 0. This happens since LableEncoder fits to each column and transform it separately. Note that in your second column among ('B', 'C', 'D') the variable 'B' is alphabetically superior.

    And finally, here is what you are looking for with sklearn:

    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import OneHotEncoder
    
    encoder = OneHotEncoder()
    label_encoder = LabelEncoder()
    data_lable_encoded = df.apply(label_encoder.fit_transform).as_matrix()
    data_feature_onehot = encoder.fit_transform(data_lable_encoded).toarray()
    print(data_feature_onehot)
    

    which gives you:

    [[ 1.  0.  0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  0.  0.]
     [ 0.  1.  0.  0.  1.  0.  0.  0.  1.  0.  1.  0.  0.  1.]
     [ 0.  0.  0.  1.  0.  0.  1.  1.  0.  1.  0.  0.  1.  0.]
     [ 0.  0.  1.  0.  0.  1.  0.  1.  0.  1.  0.  1.  0.  0.]]
    

    if you use pandas, you can compare the results and hopefully gives you a better intuition:

    encoded = pd.get_dummies(df)
    print(encoded)
    

    result:

         Feat1_A  Feat1_B  Feat1_C  Feat1_D  Feat2_B  Feat2_C  Feat2_D  Feat3_A  \
    0        1        0        0        0        1        0        0        1   
    1        0        1        0        0        1        0        0        0   
    2        0        0        0        1        0        0        1        1   
    3        0        0        1        0        0        1        0        1   
    
         Feat3_C  Feat4_A  Feat4_C  Feat5_A  Feat5_B  Feat5_C  
    0        0        1        0        1        0        0  
    1        1        0        1        0        0        1  
    2        0        1        0        0        1        0  
    3        0        1        0        1        0        0  
    

    which is exactly the same!

    0 讨论(0)
提交回复
热议问题