Split (explode) pandas dataframe string entry to separate rows

后端 未结 22 3542
一向
一向 2020-11-21 05:03

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as

相关标签:
22条回答
  • 2020-11-21 05:23

    The string function split can take an option boolean argument 'expand'.

    Here is a solution using this argument:

    (a.var1
      .str.split(",",expand=True)
      .set_index(a.var2)
      .stack()
      .reset_index(level=1, drop=True)
      .reset_index()
      .rename(columns={0:"var1"}))
    
    0 讨论(0)
  • 2020-11-21 05:26

    Another solution that uses python copy package

    import copy
    new_observations = list()
    def pandas_explode(df, column_to_explode):
        new_observations = list()
        for row in df.to_dict(orient='records'):
            explode_values = row[column_to_explode]
            del row[column_to_explode]
            if type(explode_values) is list or type(explode_values) is tuple:
                for explode_value in explode_values:
                    new_observation = copy.deepcopy(row)
                    new_observation[column_to_explode] = explode_value
                    new_observations.append(new_observation) 
            else:
                new_observation = copy.deepcopy(row)
                new_observation[column_to_explode] = explode_values
                new_observations.append(new_observation) 
        return_df = pd.DataFrame(new_observations)
        return return_df
    
    df = pandas_explode(df, column_name)
    
    0 讨论(0)
  • 2020-11-21 05:29

    Just used jiln's excellent answer from above, but needed to expand to split multiple columns. Thought I would share.

    def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row, row_accumulator, target_columns, separator):
        split_rows = []
        for target_column in target_columns:
            split_rows.append(row[target_column].split(separator))
        # Seperate for multiple columns
        for i in range(len(split_rows[0])):
            new_row = row.to_dict()
            for j in range(len(split_rows)):
                new_row[target_columns[j]] = split_rows[j][i]
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pd.DataFrame(new_rows)
    return new_df
    
    0 讨论(0)
  • 2020-11-21 05:31

    I have been struggling with out-of-memory experience using various way to explode my lists so I prepared some benchmarks to help me decide which answers to upvote. I tested five scenarios with varying proportions of the list length to the number of lists. Sharing the results below:

    Time: (less is better, click to view large version)

    Peak memory usage: (less is better)

    Conclusions:

    • @MaxU's answer (update 2), codename concatenate offers the best speed in almost every case, while keeping the peek memory usage low,
    • see @DMulligan's answer (codename stack) if you need to process lots of rows with relatively small lists and can afford increased peak memory,
    • the accepted @Chang's answer works well for data frames that have a few rows but very large lists.

    Full details (functions and benchmarking code) are in this GitHub gist. Please note that the benchmark problem was simplified and did not include splitting of strings into the list - which most solutions performed in a similar fashion.

    0 讨论(0)
  • 2020-11-21 05:31

    My version of the solution to add to this collection! :-)

    # Original problem
    from pandas import DataFrame
    import numpy as np
    a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
                   {'var1': 'd,e,f', 'var2': 2}])
    b = DataFrame([{'var1': 'a', 'var2': 1},
                   {'var1': 'b', 'var2': 1},
                   {'var1': 'c', 'var2': 1},
                   {'var1': 'd', 'var2': 2},
                   {'var1': 'e', 'var2': 2},
                   {'var1': 'f', 'var2': 2}])
    ### My solution
    import pandas as pd
    import functools
    def expand_on_cols(df, fuse_cols, delim=","):
        def expand_on_col(df, fuse_col):
            col_order = df.columns
            df_expanded = pd.DataFrame(
                df.set_index([x for x in df.columns if x != fuse_col])[fuse_col]
                .apply(lambda x: x.split(delim))
                .explode()
            ).reset_index()
            return df_expanded[col_order]
        all_expanded = functools.reduce(expand_on_col, fuse_cols, df)
        return all_expanded
    
    assert(b.equals(expand_on_cols(a, ["var1"], delim=",")))
    
    0 讨论(0)
  • 2020-11-21 05:32

    upgraded MaxU's answer with MultiIndex support

    def explode(df, lst_cols, fill_value='', preserve_index=False):
        """
        usage:
            In [134]: df
            Out[134]:
               aaa  myid        num          text
            0   10     1  [1, 2, 3]  [aa, bb, cc]
            1   11     2         []            []
            2   12     3     [1, 2]      [cc, dd]
            3   13     4         []            []
    
            In [135]: explode(df, ['num','text'], fill_value='')
            Out[135]:
               aaa  myid num text
            0   10     1   1   aa
            1   10     1   2   bb
            2   10     1   3   cc
            3   11     2
            4   12     3   1   cc
            5   12     3   2   dd
            6   13     4
        """
        # make sure `lst_cols` is list-alike
        if (lst_cols is not None
            and len(lst_cols) > 0
            and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
            lst_cols = [lst_cols]
        # all columns except `lst_cols`
        idx_cols = df.columns.difference(lst_cols)
        # calculate lengths of lists
        lens = df[lst_cols[0]].str.len()
        # preserve original index values    
        idx = np.repeat(df.index.values, lens)
        res = (pd.DataFrame({
                    col:np.repeat(df[col].values, lens)
                    for col in idx_cols},
                    index=idx)
                 .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                                for col in lst_cols}))
        # append those rows that have empty lists
        if (lens == 0).any():
            # at least one list in cells is empty
            res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                      .fillna(fill_value))
        # revert the original index order
        res = res.sort_index()
        # reset index if requested
        if not preserve_index:        
            res = res.reset_index(drop=True)
    
        # if original index is MultiIndex build the dataframe from the multiindex
        # create "exploded" DF
        if isinstance(df.index, pd.MultiIndex):
            res = res.reindex(
                index=pd.MultiIndex.from_tuples(
                    res.index,
                    names=['number', 'color']
                )
        )
        return res
    
    0 讨论(0)
提交回复
热议问题