Create dummies from column with multiple values in pandas

前端 未结 4 1130
说谎
说谎 2020-12-04 10:35

I am looking for for a pythonic way to handle the following problem.

The pandas.get_dummies() method is great to create dummies from a categorical colum

相关标签:
4条回答
  • 2020-12-04 11:17

    I believe this question needs an updated answer after coming across the MultiLabelBinarizer from sklearn.

    The usage of this is as simple as...

    # Instantiate the binarizer
    mlb = MultiLabelBinarizer()
    
    # Using OP's original data frame
    df = pd.DataFrame(data=['A', 'B', 'C', 'D', 'A*C', 'C*D'], columns=["label"])
    
    print(df)
      label
    0     A
    1     B
    2     C
    3     D
    4   A*C
    5   C*D
    
    # Convert to a list of labels
    df = df.apply(lambda x: x["label"].split("*"), axis=1)
    
    print(df)
    0       [A]
    1       [B]
    2       [C]
    3       [D]
    4    [A, C]
    5    [C, D]
    dtype: object
    
    # Transform to a binary array
    array_out = mlb.fit_transform(df)
    
    print(array_out)
    [[1 0 0 0]
     [0 1 0 0]
     [0 0 1 0]
     [0 0 0 1]
     [1 0 1 0]
     [0 0 1 1]]
    
    # Convert back to a dataframe (unnecessary step in many cases)
    df_out = pd.DataFrame(data=array_out, columns=mlb.classes_)
    
    print(df_out)
       A  B  C  D
    0  1  0  0  0
    1  0  1  0  0
    2  0  0  1  0
    3  0  0  0  1
    4  1  0  1  0
    5  0  0  1  1
    

    This is also very fast, took virtually no time (.03 seconds) across 1000 rows and 50K classes.

    0 讨论(0)
  • 2020-12-04 11:26

    I know it's been a while since this question was asked, but there is (at least now there is) a one-liner that is supported by the documentation:

    In [4]: df
    Out[4]:
          label
    0  (a, c, e)
    1     (a, d)
    2       (b,)
    3     (d, e)
    
    In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')
    Out[5]:
       a  b  c  d  e
    0  1  0  1  0  1
    1  1  0  0  1  0
    2  0  1  0  0  0
    3  0  0  0  1  1
    
    0 讨论(0)
  • 2020-12-04 11:32

    You can generate the dummies dataframe with your raw data, isolate the columns that contains a given atom, and then store the result matches back to the atom column.

    df
    Out[28]: 
      label
    0     A
    1     B
    2     C
    3     D
    4   A*C
    5   C*D
    
    dummies = pd.get_dummies(df['label'])
    
    atom_col = [c for c in dummies.columns if '*' not in c]
    
    for col in atom_col:
        ...:     df[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1)
        ...:     
    
    df
    Out[32]: 
      label  A  B  C  D
    0     A  1  0  0  0
    1     B  0  1  0  0
    2     C  0  0  1  0
    3     D  0  0  0  1
    4   A*C  1  0  1  0
    5   C*D  0  0  1  1
    
    0 讨论(0)
  • 2020-12-04 11:37

    I have a somewhat cleaner solution. Assume we want to transform the following dataframe

       pageid category
    0       0        a
    1       0        b
    2       1        a
    3       1        c
    

    into

            a  b  c
    pageid         
    0       1  1  0
    1       1  0  1
    

    One way to do it is to make use of scikit-learn's DictVectorizer. I would, however, be interested in learning about other methods.

    df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))
    
    grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))
    category_dicts = [dict(tuples) for tuples in grouped]
    v = sklearn.feature_extraction.DictVectorizer(sparse=False)
    X = v.fit_transform(category_dicts)
    
    pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)
    
    0 讨论(0)
提交回复
热议问题