How to explode a list inside a Dataframe cell into separate rows

后端 未结 11 2160
天命终不由人
天命终不由人 2020-11-22 10:20

I\'m looking to turn a pandas cell containing a list into rows for each of those values.

So, take this:

If I\'d like to unpack and stack the value

相关标签:
11条回答
  • 2020-11-22 11:14

    Nicer alternative solution with apply(pd.Series):

    df = pd.DataFrame({'listcol':[[1,2,3],[4,5,6]]})
    
    # expand df.listcol into its own dataframe
    tags = df['listcol'].apply(pd.Series)
    
    # rename each variable is listcol
    tags = tags.rename(columns = lambda x : 'listcol_' + str(x))
    
    # join the tags dataframe back to the original dataframe
    df = pd.concat([df[:], tags[:]], axis=1)
    
    0 讨论(0)
  • 2020-11-22 11:16

    Instead of using apply(pd.Series) you can flatten the column. This improves performance.

    df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))
    
    
    
    %timeit (pd.DataFrame(df['nearest_neighbors'].values.tolist(), index = df.index)
               .stack()
               .reset_index(level = 2, drop=True).to_frame('nearest_neighbors'))
    
    1.87 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    
    %timeit (df.nearest_neighbors.apply(pd.Series)
              .stack()
              .reset_index(level=2, drop=True)
              .to_frame('nearest_neighbors'))
    
    2.73 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 讨论(0)
  • 2020-11-22 11:18

    Use apply(pd.Series) and stack, then reset_index and to_frame

    In [1803]: (df.nearest_neighbors.apply(pd.Series)
                  .stack()
                  .reset_index(level=2, drop=True)
                  .to_frame('nearest_neighbors'))
    Out[1803]:
                        nearest_neighbors
    name       opponent
    A.J. Price 76ers          Zach LaVine
               76ers           Jeremy Lin
               76ers        Nate Robinson
               76ers                Isaia
               blazers        Zach LaVine
               blazers         Jeremy Lin
               blazers      Nate Robinson
               blazers              Isaia
               bobcats        Zach LaVine
               bobcats         Jeremy Lin
               bobcats      Nate Robinson
               bobcats              Isaia
    

    Details

    In [1804]: df
    Out[1804]:
                                                       nearest_neighbors
    name       opponent
    A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
    
    0 讨论(0)
  • 2020-11-22 11:21

    So all of these answers are good but I wanted something ^really simple^ so here's my contribution:

    def explode(series):
        return pd.Series([x for _list in series for x in _list])                               
    

    That's it.. just use this when you want a new series where the lists are 'exploded'. Here's an example where we do value_counts() on taco choices :)

    In [1]: my_df = pd.DataFrame(pd.Series([['a','b','c'],['b','c'],['c']]), columns=['tacos'])      
    In [2]: my_df.head()                                                                               
    Out[2]: 
       tacos
    0  [a, b, c]
    1     [b, c]
    2        [c]
    
    In [3]: explode(my_df['tacos']).value_counts()                                                     
    Out[3]: 
    c    3
    b    2
    a    1
    
    0 讨论(0)
  • 2020-11-22 11:22

    I think this a really good question, in Hive you would use EXPLODE, I think there is a case to be made that Pandas should include this functionality by default. I would probably explode the list column with a nested generator comprehension like this:

    pd.DataFrame({
        "name": i[0],
        "opponent": i[1],
        "nearest_neighbor": neighbour
        }
        for i, row in df.iterrows() for neighbour in row.nearest_neighbors
        ).set_index(["name", "opponent"])
    
    0 讨论(0)
提交回复
热议问题