Have Pandas column containing lists, how to pivot unique list elements to columns?

前端未结

关注

 5  780

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr

相关标签:

5条回答

傲寒

2021-02-07 23:13
The answers posted by @piRSquared and @MaxU works very well.

But, only when the data doesn't have any NaN values. Data I was working with was very sparse. It had around 1M rows which was getting reduced to only some 100 rows after applying the above method as it dropped all the rows with NaNs in any of the columns. Took me more than a day to figure out the fixes. Sharing the slightly modified code to save time for others.

Supposing you have df DataFrame as mentioned above,
- Replace all NaN occurrences first with something which is not expected in any of the other columns, as you have to replace it back to NaN later.
```
cols = ['PRODUCTS', 'DATE']
col = "DESCRIPTION"
df.loc[:, cols] = df.loc[:, cols].fillna("SOME_UNIQ_NAN_REPLACEMENT")
```
  This is needed as groupby drops all rows with NaN values. :/
- Then we run what is suggested in other answers with a minor modification stack(dropna=False). By default, dropna=True.
```
df = pd.get_dummies(df.set_index(index_columns[col]\
        .str.split(",\s*", expand=True).stack(dropna=False), prefix=col)\
        .groupby(index_columns, sort=False).sum().astype(int).reset_index()
```
- And then you put back NaN in df to not to alter data of other columns.
```
df.replace("SOME_UNIQ_NAN_REPLACEMENT", np.nan, inplace=True)
```
Hope this saves hours of frustration for someone.
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光说笑

2021-02-07 23:18

Use pd.get_dummies

cols = ['PRODUCTS', 'DATE']
pd.get_dummies(
    df.set_index(cols).DESCRIPTION \
      .str.split(',\s*', expand=True).stack()
).groupby(level=cols).sum().astype(int)

0 讨论(0)

孤城傲影

2021-02-07 23:25

How about something that places an 'X' in the feature column if the product has that feature.

The below creates a list of unique features ('Steel', 'Red', etc.), then creates a column for each feature in the original df. Then we iterate through each row and for each product feature, we place an 'X' in the cell.

ml = []  

a = [ml.append(item) for l in df.DESCRIPTION for item in l]

unique_list_of_attributes = list(set(ml)) # unique features list

# place empty columns in original df for each feature
df = pd.concat([df,pd.DataFrame(columns=unique_list_of_attributes)]).fillna(value='')

# add 'X' in column if product has feature
for row in df.iterrows():
    for attribute in row[1]['DESCRIPTION']:
        df.loc[row[0],attribute] = 'X'

updated with example output:

    PRODUCTS       DATE                 DESCRIPTION Blue HighHardness  \
0  Product A  2016-9-12  [Steel, Red, HighHardness]                 X   
1  Product B  2016-9-11  [Blue, Lightweight, Steel]    X                
2  Product C  2016-9-12                       [Red]                     

  Lightweight Red Steel  
0               X     X  
1           X         X  
2               X

0 讨论(0)

谎友^

2021-02-07 23:33

you can build up a sparse matrix:

In [27]: df
Out[27]:
    PRODUCTS       DATE                DESCRIPTION
0  Product A  2016-9-12  Steel, Red, High Hardness
1  Product B  2016-9-11   Blue, Lightweight, Steel
2  Product C  2016-9-12                        Red

In [28]: (df.set_index(['PRODUCTS','DATE'])
   ....:    .DESCRIPTION.str.split(',\s*', expand=True)
   ....:    .stack()
   ....:    .reset_index()
   ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size')
   ....: )
Out[28]:
0                    Blue  High Hardness  Lightweight  Red  Steel
PRODUCTS  DATE
Product A 2016-9-12     0              1            0    1      1
Product B 2016-9-11     1              0            1    0      1
Product C 2016-9-12     0              0            0    1      0

In [29]: (df.set_index(['PRODUCTS','DATE'])
   ....:    .DESCRIPTION.str.split(',\s*', expand=True)
   ....:    .stack()
   ....:    .reset_index()
   ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size')
   ....: )
Out[29]:
0                   Blue High Hardness Lightweight Red Steel
PRODUCTS  DATE
Product A 2016-9-12                  1               1     1
Product B 2016-9-11    1                         1         1
Product C 2016-9-12                                  1

0 讨论(0)

野的像风

2021-02-07 23:34

Here is my crack at a solution extended from a problem I was already working on.

def group_agg_pivot_df(df, group_cols, agg_func='count', agg_col=None):

    if agg_col is None:
        agg_col = group_cols[0]

    grouped = df.groupby(group_cols).agg({agg_col: agg_func}) \
        .unstack().fillna(0)
    # drop aggregation column name from hierarchical column names
    grouped.columns = grouped.columns.droplevel()

    # promote index to column (the first element of group_cols)
    pivot_df = grouped.reset_index()
    pivot_df.columns = [s.replace(' ', '_').lower() for s in pivot_df.columns]
    return pivot_df

def split_stack_df(df, id_cols, split_col, new_col_name):
    # id_cols are the columns we want to pair with the values
    # from the split column

    stacked = df.set_index(id_cols)[split_col].str.split(',', expand=True) \
        .stack().reset_index(level=id_cols)
    stacked.columns = id_cols + [new_col_name]
    return stacked

stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc')
final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])

I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.

Here are the benchmarks for each method

In [277]: %timeit pd.get_dummies(df.set_index(['PRODUCTS', 'DATE']) \
 ...:                        .DESCRIPTION.str.split(',', expand=True) \
 ...:                        .stack()) \
 ...:     .groupby(['PRODUCTS', 'DATE']).sum()
 ...:

1 loop, best of 3: 1.14 s per loop

In [278]: %timeit df.set_index(['PRODUCTS', 'DATE']) \
 ...:     .DESCRIPTION.str.split(',', expand=True) \
 ...:     .stack() \
 ...:     .reset_index() \
 ...:     .pivot_table(index=['PRODUCTS', 'DATE'], columns=0, fill_value=0, aggfunc='size')

1 loop, best of 3: 612 ms per loop

In [286]: %timeit stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc'); \
 ...:     final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])

1 loop, best of 3: 62.7 ms per loop

My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().

0 讨论(0)