How to one hot encode variant length features?

后端 未结 2 1242
时光取名叫无心
时光取名叫无心 2020-11-27 07:42

Given a list of variant length features:

features = [
    [\'f1\', \'f2\', \'f3\'],
    [\'f2\', \'f4\', \'f5\', \'f6\'],
    [\'f1\', \'f2\']
]
相关标签:
2条回答
  • 2020-11-27 08:36

    Here's one approach with NumPy methods and outputting as pandas dataframe -

    import numpy as np
    import pandas as pd
    
    lens = list(map(len, features))
    N = len(lens)
    unq, col = np.unique(np.concatenate(features),return_inverse=1)
    row = np.repeat(np.arange(N), lens)
    out = np.zeros((N,len(unq)),dtype=int)
    out[row,col] = 1
    
    indx = ['s'+str(i+1) for i in range(N)]
    df_out = pd.DataFrame(out, columns=unq, index=indx)
    

    Sample input, output -

    In [80]: features
    Out[80]: [['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2']]
    
    In [81]: df_out
    Out[81]: 
        f1  f2  f3  f4  f5  f6
    s1   1   1   1   0   0   0
    s2   0   1   0   1   1   1
    s3   1   1   0   0   0   0
    
    0 讨论(0)
  • 2020-11-27 08:41

    You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.

    Code for your example:

    features = [
                ['f1', 'f2', 'f3'],
                ['f2', 'f4', 'f5', 'f6'],
                ['f1', 'f2']
               ]
    from sklearn.preprocessing import MultiLabelBinarizer
    mlb = MultiLabelBinarizer()
    new_features = mlb.fit_transform(features)
    

    Output:

    array([[1, 1, 1, 0, 0, 0],
           [0, 1, 0, 1, 1, 1],
           [1, 1, 0, 0, 0, 0]])
    

    This can also be used in a pipeline, along with other feature_selection utilities.

    0 讨论(0)
提交回复
热议问题