One hot encoding of string categorical features

前端 未结 3 789
盖世英雄少女心
盖世英雄少女心 2020-12-04 17:51

I\'m trying to perform a one hot encoding of a trivial dataset.

data = [[\'a\', \'dog\', \'red\']
        [\'b\', \'cat\', \'green\']]

Wha

相关标签:
3条回答
  • 2020-12-04 18:32

    If you are on sklearn>0.20.dev0

    In [11]: from sklearn.preprocessing import OneHotEncoder
        ...: cat = OneHotEncoder()
        ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
        ...: cat.fit_transform(X).toarray()
        ...: 
    Out[11]: array([[1., 0., 0., 1., 0.],
               [0., 1., 0., 0., 1.],
               [1., 0., 0., 1., 0.],
               [0., 0., 1., 0., 1.]])
    

    If you are on sklearn==0.20.dev0

    In [30]: cat = CategoricalEncoder()
    
    In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
    
    In [32]: cat.fit_transform(X).toarray()
    Out[32]:
    array([[ 1.,  0., 0.,  1.,  0.],
           [ 0.,  1.,  0.,  0.,  1.],
           [ 1.,  0.,  0.,  1.,  0.],
           [ 0.,  0.,  1.,  0.,  1.]])
    

    Another way to do it is to use category_encoders.

    Here is an example:

    % pip install category_encoders
    import category_encoders as ce
    le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
    X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
    le.fit_transform(X)
    array([[1, 0, 1, 0, 1, 0],
           [0, 1, 0, 1, 0, 1]])
    
    0 讨论(0)
  • 2020-12-04 18:51

    Very nice question.

    However, in some sense, it is a private case of something that comes up (at least for me) rather often - given sklearn stages applicable to subsets of the X matrix, I'd like to apply (possibly several) given the entire matrix. Here, for example, you have a stage which knows to run on a single column, and you'd like to apply it thrice - once per column.

    This is a classic case for using the Composite Design Pattern.

    Here is a (sketch of a) reusable stage that accepts a dictionary mapping a column index into the transformation to apply to it:

    class ColumnApplier(object):
        def __init__(self, column_stages):
            self._column_stages = column_stages
    
        def fit(self, X, y):
            for i, k in self._column_stages.items():
                k.fit(X[:, i])
    
            return self
    
        def transform(self, X):
            X = X.copy()
            for i, k in self._column_stages.items():
                X[:, i] = k.transform(X[:, i])
    
            return X
    

    Now, to use it in this context, starting with

    X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
    y = np.array([1, 2])
    X
    

    you would just use it to map each column index to the transformation you want:

    multi_encoder = \
        ColumnApplier(dict([(i, preprocessing.LabelEncoder()) for i in range(3)]))
    multi_encoder.fit(X, None).transform(X)
    

    Once you develop such a stage (I can't post the one I use), you can use it over and over for various settings.

    0 讨论(0)
  • 2020-12-04 18:58

    I've faced this problem many times and I found a solution in this book at his page 100 :

    We can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

    and the sample code is here :

    from sklearn.preprocessing import LabelBinarizer
    encoder = LabelBinarizer()
    housing_cat_1hot = encoder.fit_transform(data)
    housing_cat_1hot
    

    and as a result : Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.

    And you can find more about the LabelBinarizer, here in the sklearn official documentation

    0 讨论(0)
提交回复
热议问题