I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr
The answers posted by @piRSquared and @MaxU works very well.
But, only when the data doesn't have any NaN
values. Data I was working with was very sparse. It had around 1M rows which was getting reduced to only some 100 rows after applying the above method as it dropped all the rows with NaN
s in any of the columns. Took me more than a day to figure out the fixes. Sharing the slightly modified code to save time for others.
Supposing you have df
DataFrame as mentioned above,
Replace all NaN
occurrences first with something which is not expected in any of the other columns, as you have to replace it back to NaN
later.
cols = ['PRODUCTS', 'DATE']
col = "DESCRIPTION"
df.loc[:, cols] = df.loc[:, cols].fillna("SOME_UNIQ_NAN_REPLACEMENT")
This is needed as groupby drops all rows with NaN values. :/
Then we run what is suggested in other answers with a minor modification stack(dropna=False)
. By default, dropna=True
.
df = pd.get_dummies(df.set_index(index_columns[col]\
.str.split(",\s*", expand=True).stack(dropna=False), prefix=col)\
.groupby(index_columns, sort=False).sum().astype(int).reset_index()
And then you put back NaN
in df
to not to alter data of other columns.
df.replace("SOME_UNIQ_NAN_REPLACEMENT", np.nan, inplace=True)
Hope this saves hours of frustration for someone.
Use pd.get_dummies
cols = ['PRODUCTS', 'DATE']
pd.get_dummies(
df.set_index(cols).DESCRIPTION \
.str.split(',\s*', expand=True).stack()
).groupby(level=cols).sum().astype(int)
How about something that places an 'X' in the feature column if the product has that feature.
The below creates a list of unique features ('Steel', 'Red', etc.), then creates a column for each feature in the original df. Then we iterate through each row and for each product feature, we place an 'X' in the cell.
ml = []
a = [ml.append(item) for l in df.DESCRIPTION for item in l]
unique_list_of_attributes = list(set(ml)) # unique features list
# place empty columns in original df for each feature
df = pd.concat([df,pd.DataFrame(columns=unique_list_of_attributes)]).fillna(value='')
# add 'X' in column if product has feature
for row in df.iterrows():
for attribute in row[1]['DESCRIPTION']:
df.loc[row[0],attribute] = 'X'
updated with example output:
PRODUCTS DATE DESCRIPTION Blue HighHardness \
0 Product A 2016-9-12 [Steel, Red, HighHardness] X
1 Product B 2016-9-11 [Blue, Lightweight, Steel] X
2 Product C 2016-9-12 [Red]
Lightweight Red Steel
0 X X
1 X X
2 X
you can build up a sparse matrix:
In [27]: df
Out[27]:
PRODUCTS DATE DESCRIPTION
0 Product A 2016-9-12 Steel, Red, High Hardness
1 Product B 2016-9-11 Blue, Lightweight, Steel
2 Product C 2016-9-12 Red
In [28]: (df.set_index(['PRODUCTS','DATE'])
....: .DESCRIPTION.str.split(',\s*', expand=True)
....: .stack()
....: .reset_index()
....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size')
....: )
Out[28]:
0 Blue High Hardness Lightweight Red Steel
PRODUCTS DATE
Product A 2016-9-12 0 1 0 1 1
Product B 2016-9-11 1 0 1 0 1
Product C 2016-9-12 0 0 0 1 0
In [29]: (df.set_index(['PRODUCTS','DATE'])
....: .DESCRIPTION.str.split(',\s*', expand=True)
....: .stack()
....: .reset_index()
....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size')
....: )
Out[29]:
0 Blue High Hardness Lightweight Red Steel
PRODUCTS DATE
Product A 2016-9-12 1 1 1
Product B 2016-9-11 1 1 1
Product C 2016-9-12 1
Here is my crack at a solution extended from a problem I was already working on.
def group_agg_pivot_df(df, group_cols, agg_func='count', agg_col=None):
if agg_col is None:
agg_col = group_cols[0]
grouped = df.groupby(group_cols).agg({agg_col: agg_func}) \
.unstack().fillna(0)
# drop aggregation column name from hierarchical column names
grouped.columns = grouped.columns.droplevel()
# promote index to column (the first element of group_cols)
pivot_df = grouped.reset_index()
pivot_df.columns = [s.replace(' ', '_').lower() for s in pivot_df.columns]
return pivot_df
def split_stack_df(df, id_cols, split_col, new_col_name):
# id_cols are the columns we want to pair with the values
# from the split column
stacked = df.set_index(id_cols)[split_col].str.split(',', expand=True) \
.stack().reset_index(level=id_cols)
stacked.columns = id_cols + [new_col_name]
return stacked
stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc')
final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.
Here are the benchmarks for each method
In [277]: %timeit pd.get_dummies(df.set_index(['PRODUCTS', 'DATE']) \
...: .DESCRIPTION.str.split(',', expand=True) \
...: .stack()) \
...: .groupby(['PRODUCTS', 'DATE']).sum()
...:
1 loop, best of 3: 1.14 s per loop
In [278]: %timeit df.set_index(['PRODUCTS', 'DATE']) \
...: .DESCRIPTION.str.split(',', expand=True) \
...: .stack() \
...: .reset_index() \
...: .pivot_table(index=['PRODUCTS', 'DATE'], columns=0, fill_value=0, aggfunc='size')
1 loop, best of 3: 612 ms per loop
In [286]: %timeit stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc'); \
...: final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])
1 loop, best of 3: 62.7 ms per loop
My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().