I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr
The answers posted by @piRSquared and @MaxU works very well.
But, only when the data doesn't have any NaN
values. Data I was working with was very sparse. It had around 1M rows which was getting reduced to only some 100 rows after applying the above method as it dropped all the rows with NaN
s in any of the columns. Took me more than a day to figure out the fixes. Sharing the slightly modified code to save time for others.
Supposing you have df
DataFrame as mentioned above,
Replace all NaN
occurrences first with something which is not expected in any of the other columns, as you have to replace it back to NaN
later.
cols = ['PRODUCTS', 'DATE']
col = "DESCRIPTION"
df.loc[:, cols] = df.loc[:, cols].fillna("SOME_UNIQ_NAN_REPLACEMENT")
This is needed as groupby drops all rows with NaN values. :/
Then we run what is suggested in other answers with a minor modification stack(dropna=False)
. By default, dropna=True
.
df = pd.get_dummies(df.set_index(index_columns[col]\
.str.split(",\s*", expand=True).stack(dropna=False), prefix=col)\
.groupby(index_columns, sort=False).sum().astype(int).reset_index()
And then you put back NaN
in df
to not to alter data of other columns.
df.replace("SOME_UNIQ_NAN_REPLACEMENT", np.nan, inplace=True)
Hope this saves hours of frustration for someone.