How to explode a list inside a Dataframe cell into separate rows

后端 未结 11 2159
天命终不由人
天命终不由人 2020-11-22 10:20

I\'m looking to turn a pandas cell containing a list into rows for each of those values.

So, take this:

If I\'d like to unpack and stack the value

相关标签:
11条回答
  • 2020-11-22 10:57
    • Exploding a list-like column has been simplified significantly in pandas 0.25 with the addition of the explode() method:
    df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                        'opponent': ['76ers', 'blazers', 'bobcats'], 
                        'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
          .set_index(['name', 'opponent']))
    
    df.explode('nearest_neighbors')
    

    Out:

                        nearest_neighbors
    name       opponent                  
    A.J. Price 76ers          Zach LaVine
               76ers           Jeremy Lin
               76ers        Nate Robinson
               76ers                Isaia
               blazers        Zach LaVine
               blazers         Jeremy Lin
               blazers      Nate Robinson
               blazers              Isaia
               bobcats        Zach LaVine
               bobcats         Jeremy Lin
               bobcats      Nate Robinson
               bobcats              Isaia
    
    0 讨论(0)
  • 2020-11-22 11:03

    In the code below, I first reset the index to make the row iteration easier.

    I create a list of lists where each element of the outer list is a row of the target DataFrame and each element of the inner list is one of the columns. This nested list will ultimately be concatenated to create the desired DataFrame.

    I use a lambda function together with list iteration to create a row for each element of the nearest_neighbors paired with the relevant name and opponent.

    Finally, I create a new DataFrame from this list (using the original column names and setting the index back to name and opponent).

    df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                        'opponent': ['76ers', 'blazers', 'bobcats'], 
                        'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
          .set_index(['name', 'opponent']))
    
    >>> df
                                                        nearest_neighbors
    name       opponent                                                  
    A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
    
    df.reset_index(inplace=True)
    rows = []
    _ = df.apply(lambda row: [rows.append([row['name'], row['opponent'], nn]) 
                             for nn in row.nearest_neighbors], axis=1)
    df_new = pd.DataFrame(rows, columns=df.columns).set_index(['name', 'opponent'])
    
    >>> df_new
                        nearest_neighbors
    name       opponent                  
    A.J. Price 76ers          Zach LaVine
               76ers           Jeremy Lin
               76ers        Nate Robinson
               76ers                Isaia
               blazers        Zach LaVine
               blazers         Jeremy Lin
               blazers      Nate Robinson
               blazers              Isaia
               bobcats        Zach LaVine
               bobcats         Jeremy Lin
               bobcats      Nate Robinson
               bobcats              Isaia
    

    EDIT JUNE 2017

    An alternative method is as follows:

    >>> (pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(), 
                 id_vars=['name', 'opponent'],
                 value_name='nearest_neighbors')
         .set_index(['name', 'opponent'])
         .drop('variable', axis=1)
         .dropna()
         .sort_index()
         )
    
    0 讨论(0)
  • 2020-11-22 11:06

    Similar to Hive's EXPLODE functionality:

    import copy
    
    def pandas_explode(df, column_to_explode):
        """
        Similar to Hive's EXPLODE function, take a column with iterable elements, and flatten the iterable to one element 
        per observation in the output table
    
        :param df: A dataframe to explod
        :type df: pandas.DataFrame
        :param column_to_explode: 
        :type column_to_explode: str
        :return: An exploded data frame
        :rtype: pandas.DataFrame
        """
    
        # Create a list of new observations
        new_observations = list()
    
        # Iterate through existing observations
        for row in df.to_dict(orient='records'):
    
            # Take out the exploding iterable
            explode_values = row[column_to_explode]
            del row[column_to_explode]
    
            # Create a new observation for every entry in the exploding iterable & add all of the other columns
            for explode_value in explode_values:
    
                # Deep copy existing observation
                new_observation = copy.deepcopy(row)
    
                # Add one (newly flattened) value from exploding iterable
                new_observation[column_to_explode] = explode_value
    
                # Add to the list of new observations
                new_observations.append(new_observation)
    
        # Create a DataFrame
        return_df = pandas.DataFrame(new_observations)
    
        # Return
        return return_df
    
    0 讨论(0)
  • 2020-11-22 11:08

    Extending Oleg's .iloc answer to automatically flatten all list-columns:

    def extend_iloc(df):
        cols_to_flatten = [colname for colname in df.columns if 
        isinstance(df.iloc[0][colname], list)]
        # Row numbers to repeat 
        lens = df[cols_to_flatten[0]].apply(len)
        vals = range(df.shape[0])
        ilocations = np.repeat(vals, lens)
        # Replicate rows and add flattened column of lists
        with_idxs = [(i, c) for (i, c) in enumerate(df.columns) if c not in cols_to_flatten]
        col_idxs = list(zip(*with_idxs)[0])
        new_df = df.iloc[ilocations, col_idxs].copy()
    
        # Flatten columns of lists
        for col_target in cols_to_flatten:
            col_flat = [item for sublist in df[col_target] for item in sublist]
            new_df[col_target] = col_flat
    
        return new_df
    

    This assumes that each list-column has equal list length.

    0 讨论(0)
  • 2020-11-22 11:10

    The fastest method I found so far is extending the DataFrame with .iloc and assigning back the flattened target column.

    Given the usual input (replicated a bit):

    df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                        'opponent': ['76ers', 'blazers', 'bobcats'], 
                        'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
          .set_index(['name', 'opponent']))
    df = pd.concat([df]*10)
    
    df
    Out[3]: 
                                                       nearest_neighbors
    name       opponent                                                 
    A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
               blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
    ...
    

    Given the following suggested alternatives:

    col_target = 'nearest_neighbors'
    
    def extend_iloc():
        # Flatten columns of lists
        col_flat = [item for sublist in df[col_target] for item in sublist] 
        # Row numbers to repeat 
        lens = df[col_target].apply(len)
        vals = range(df.shape[0])
        ilocations = np.repeat(vals, lens)
        # Replicate rows and add flattened column of lists
        cols = [i for i,c in enumerate(df.columns) if c != col_target]
        new_df = df.iloc[ilocations, cols].copy()
        new_df[col_target] = col_flat
        return new_df
    
    def melt():
        return (pd.melt(df[col_target].apply(pd.Series).reset_index(), 
                 id_vars=['name', 'opponent'],
                 value_name=col_target)
                .set_index(['name', 'opponent'])
                .drop('variable', axis=1)
                .dropna()
                .sort_index())
    
    def stack_unstack():
        return (df[col_target].apply(pd.Series)
                .stack()
                .reset_index(level=2, drop=True)
                .to_frame(col_target))
    

    I find that extend_iloc() is the fastest:

    %timeit extend_iloc()
    3.11 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit melt()
    22.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit stack_unstack()
    11.5 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    0 讨论(0)
  • 2020-11-22 11:11

    Here is a potential optimization for larger dataframes. This runs faster when there are several equal values in the "exploding" field. (The larger the dataframe is compared to the unique value count in the field, the better this code will perform.)

    def lateral_explode(dataframe, fieldname): 
        temp_fieldname = fieldname + '_made_tuple_' 
        dataframe[temp_fieldname] = dataframe[fieldname].apply(tuple)       
        list_of_dataframes = []
        for values in dataframe[temp_fieldname].unique().tolist(): 
            list_of_dataframes.append(pd.DataFrame({
                temp_fieldname: [values] * len(values), 
                fieldname: list(values), 
            }))
        dataframe = dataframe[list(set(dataframe.columns) - set([fieldname]))]\ 
            .merge(pd.concat(list_of_dataframes), how='left', on=temp_fieldname) 
        del dataframe[temp_fieldname]
    
        return dataframe
    
    0 讨论(0)
提交回复
热议问题