Have Pandas column containing lists, how to pivot unique list elements to columns?

前端 未结 5 783
旧时难觅i
旧时难觅i 2021-02-07 22:39

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attr

5条回答
  •  傲寒
    傲寒 (楼主)
    2021-02-07 23:13

    The answers posted by @piRSquared and @MaxU works very well.

    But, only when the data doesn't have any NaN values. Data I was working with was very sparse. It had around 1M rows which was getting reduced to only some 100 rows after applying the above method as it dropped all the rows with NaNs in any of the columns. Took me more than a day to figure out the fixes. Sharing the slightly modified code to save time for others.

    Supposing you have df DataFrame as mentioned above,

    • Replace all NaN occurrences first with something which is not expected in any of the other columns, as you have to replace it back to NaN later.

      cols = ['PRODUCTS', 'DATE']
      col = "DESCRIPTION"
      df.loc[:, cols] = df.loc[:, cols].fillna("SOME_UNIQ_NAN_REPLACEMENT")
      

      This is needed as groupby drops all rows with NaN values. :/

    • Then we run what is suggested in other answers with a minor modification stack(dropna=False). By default, dropna=True.

      df = pd.get_dummies(df.set_index(index_columns[col]\
              .str.split(",\s*", expand=True).stack(dropna=False), prefix=col)\
              .groupby(index_columns, sort=False).sum().astype(int).reset_index()
      
    • And then you put back NaN in df to not to alter data of other columns.

      df.replace("SOME_UNIQ_NAN_REPLACEMENT", np.nan, inplace=True)
      

    Hope this saves hours of frustration for someone.

提交回复
热议问题