Have Pandas column containing lists, how to pivot unique list elements to columns?

前端 未结 5 775
旧时难觅i
旧时难觅i 2021-02-07 22:39

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr

相关标签:
5条回答
  • 2021-02-07 23:13

    The answers posted by @piRSquared and @MaxU works very well.

    But, only when the data doesn't have any NaN values. Data I was working with was very sparse. It had around 1M rows which was getting reduced to only some 100 rows after applying the above method as it dropped all the rows with NaNs in any of the columns. Took me more than a day to figure out the fixes. Sharing the slightly modified code to save time for others.

    Supposing you have df DataFrame as mentioned above,

    • Replace all NaN occurrences first with something which is not expected in any of the other columns, as you have to replace it back to NaN later.

      cols = ['PRODUCTS', 'DATE']
      col = "DESCRIPTION"
      df.loc[:, cols] = df.loc[:, cols].fillna("SOME_UNIQ_NAN_REPLACEMENT")
      

      This is needed as groupby drops all rows with NaN values. :/

    • Then we run what is suggested in other answers with a minor modification stack(dropna=False). By default, dropna=True.

      df = pd.get_dummies(df.set_index(index_columns[col]\
              .str.split(",\s*", expand=True).stack(dropna=False), prefix=col)\
              .groupby(index_columns, sort=False).sum().astype(int).reset_index()
      
    • And then you put back NaN in df to not to alter data of other columns.

      df.replace("SOME_UNIQ_NAN_REPLACEMENT", np.nan, inplace=True)
      

    Hope this saves hours of frustration for someone.

    0 讨论(0)
  • 2021-02-07 23:18

    Use pd.get_dummies

    cols = ['PRODUCTS', 'DATE']
    pd.get_dummies(
        df.set_index(cols).DESCRIPTION \
          .str.split(',\s*', expand=True).stack()
    ).groupby(level=cols).sum().astype(int)
    

    0 讨论(0)
  • 2021-02-07 23:25

    How about something that places an 'X' in the feature column if the product has that feature.

    The below creates a list of unique features ('Steel', 'Red', etc.), then creates a column for each feature in the original df. Then we iterate through each row and for each product feature, we place an 'X' in the cell.

    ml = []  
    
    a = [ml.append(item) for l in df.DESCRIPTION for item in l]
    
    unique_list_of_attributes = list(set(ml)) # unique features list
    
    # place empty columns in original df for each feature
    df = pd.concat([df,pd.DataFrame(columns=unique_list_of_attributes)]).fillna(value='')
    
    # add 'X' in column if product has feature
    for row in df.iterrows():
        for attribute in row[1]['DESCRIPTION']:
            df.loc[row[0],attribute] = 'X'
    

    updated with example output:

        PRODUCTS       DATE                 DESCRIPTION Blue HighHardness  \
    0  Product A  2016-9-12  [Steel, Red, HighHardness]                 X   
    1  Product B  2016-9-11  [Blue, Lightweight, Steel]    X                
    2  Product C  2016-9-12                       [Red]                     
    
      Lightweight Red Steel  
    0               X     X  
    1           X         X  
    2               X       
    
    0 讨论(0)
  • 2021-02-07 23:33

    you can build up a sparse matrix:

    In [27]: df
    Out[27]:
        PRODUCTS       DATE                DESCRIPTION
    0  Product A  2016-9-12  Steel, Red, High Hardness
    1  Product B  2016-9-11   Blue, Lightweight, Steel
    2  Product C  2016-9-12                        Red
    
    In [28]: (df.set_index(['PRODUCTS','DATE'])
       ....:    .DESCRIPTION.str.split(',\s*', expand=True)
       ....:    .stack()
       ....:    .reset_index()
       ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size')
       ....: )
    Out[28]:
    0                    Blue  High Hardness  Lightweight  Red  Steel
    PRODUCTS  DATE
    Product A 2016-9-12     0              1            0    1      1
    Product B 2016-9-11     1              0            1    0      1
    Product C 2016-9-12     0              0            0    1      0
    
    In [29]: (df.set_index(['PRODUCTS','DATE'])
       ....:    .DESCRIPTION.str.split(',\s*', expand=True)
       ....:    .stack()
       ....:    .reset_index()
       ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size')
       ....: )
    Out[29]:
    0                   Blue High Hardness Lightweight Red Steel
    PRODUCTS  DATE
    Product A 2016-9-12                  1               1     1
    Product B 2016-9-11    1                         1         1
    Product C 2016-9-12                                  1
    
    0 讨论(0)
  • 2021-02-07 23:34

    Here is my crack at a solution extended from a problem I was already working on.

    def group_agg_pivot_df(df, group_cols, agg_func='count', agg_col=None):
    
        if agg_col is None:
            agg_col = group_cols[0]
    
        grouped = df.groupby(group_cols).agg({agg_col: agg_func}) \
            .unstack().fillna(0)
        # drop aggregation column name from hierarchical column names
        grouped.columns = grouped.columns.droplevel()
    
        # promote index to column (the first element of group_cols)
        pivot_df = grouped.reset_index()
        pivot_df.columns = [s.replace(' ', '_').lower() for s in pivot_df.columns]
        return pivot_df
    
    def split_stack_df(df, id_cols, split_col, new_col_name):
        # id_cols are the columns we want to pair with the values
        # from the split column
    
        stacked = df.set_index(id_cols)[split_col].str.split(',', expand=True) \
            .stack().reset_index(level=id_cols)
        stacked.columns = id_cols + [new_col_name]
        return stacked
    
    stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc')
    final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
    

    I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.

    Here are the benchmarks for each method

    In [277]: %timeit pd.get_dummies(df.set_index(['PRODUCTS', 'DATE']) \
     ...:                        .DESCRIPTION.str.split(',', expand=True) \
     ...:                        .stack()) \
     ...:     .groupby(['PRODUCTS', 'DATE']).sum()
     ...: 
    

    1 loop, best of 3: 1.14 s per loop

    In [278]: %timeit df.set_index(['PRODUCTS', 'DATE']) \
     ...:     .DESCRIPTION.str.split(',', expand=True) \
     ...:     .stack() \
     ...:     .reset_index() \
     ...:     .pivot_table(index=['PRODUCTS', 'DATE'], columns=0, fill_value=0, aggfunc='size')
    

    1 loop, best of 3: 612 ms per loop

    In [286]: %timeit stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc'); \
     ...:     final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
    

    1 loop, best of 3: 62.7 ms per loop

    My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().

    0 讨论(0)
提交回复
热议问题