Split (explode) pandas dataframe string entry to separate rows

后端 未结 22 3477
一向
一向 2020-11-21 05:03

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as

相关标签:
22条回答
  • 2020-11-21 05:17

    Based on the excellent @DMulligan's solution, here is a generic vectorized (no loops) function which splits a column of a dataframe into multiple rows, and merges it back to the original dataframe. It also uses a great generic change_column_order function from this answer.

    def change_column_order(df, col_name, index):
        cols = df.columns.tolist()
        cols.remove(col_name)
        cols.insert(index, col_name)
        return df[cols]
    
    def split_df(dataframe, col_name, sep):
        orig_col_index = dataframe.columns.tolist().index(col_name)
        orig_index_name = dataframe.index.name
        orig_columns = dataframe.columns
        dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
        index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
        df_split = pd.DataFrame(
            pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
            .stack().reset_index(level=1, drop=1), columns=[col_name])
        df = dataframe.drop(col_name, axis=1)
        df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
        df = df.set_index(index_col_name)
        df.index.name = orig_index_name
        # merge adds the column to the last place, so we need to move it back
        return change_column_order(df, col_name, orig_col_index)
    

    Example:

    df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], 
                      columns=['Name', 'A', 'B'], index=[10, 12, 13])
    df
            Name    A   B
        10   a:b     1   4
        12   c:d     2   5
        13   e:f:g:h 3   6
    
    split_df(df, 'Name', ':')
        Name    A   B
    10   a       1   4
    10   b       1   4
    12   c       2   5
    12   d       2   5
    13   e       3   6
    13   f       3   6    
    13   g       3   6    
    13   h       3   6    
    

    Note that it preserves the original index and order of the columns. It also works with dataframes which have non-sequential index.

    0 讨论(0)
  • 2020-11-21 05:17

    I have come up with the following solution to this problem:

    def iter_var1(d):
        for _, row in d.iterrows():
            for v in row["var1"].split(","):
                yield (v, row["var2"])
    
    new_a = DataFrame.from_records([i for i in iter_var1(a)],
            columns=["var1", "var2"])
    
    0 讨论(0)
  • 2020-11-21 05:18

    Pandas >= 0.25

    Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

    Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.

    df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})
    df
        var1  var2
    0  a,b,c     1
    1  d,e,f     2
    
    df.assign(var1=df['var1'].str.split(',')).explode('var1')
    
      var1  var2
    0    a     1
    0    b     1
    0    c     1
    1    d     2
    1    e     2
    1    f     2
    

    Note that explode only works on a single column (for now).


    NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right.

    df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})
    df
        var1  var2
    0  d,e,f     1
    1            2
    2    NaN     3
    
    df['var1'].str.split(',')
    
    0    [d, e, f]
    1           []
    2          NaN
    
    df.assign(var1=df['var1'].str.split(',')).explode('var1')
    
      var1  var2
    0    d     1
    0    e     1
    0    f     1
    1          2  # empty list entry becomes empty string after exploding 
    2  NaN     3  # NaN left un-touched
    

    This is a serious advantage over ravel + repeat -based solutions (which ignore empty lists completely, and choke on NaNs).

    0 讨论(0)
  • 2020-11-21 05:22

    How about something like this:

    In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                        for _, row in a.iterrows()]).reset_index()
    Out[55]: 
      index  0
    0     a  1
    1     b  1
    2     c  1
    3     d  2
    4     e  2
    5     f  2
    

    Then you just have to rename the columns

    0 讨论(0)
  • 2020-11-21 05:22

    I came up with a solution for dataframes with arbitrary numbers of columns (while still only separating one column's entries at a time).

    def splitDataFrameList(df,target_column,separator):
        ''' df = dataframe to split,
        target_column = the column containing the values to split
        separator = the symbol used to perform the split
    
        returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
        The values in the other columns are duplicated across the newly divided rows.
        '''
        def splitListToRows(row,row_accumulator,target_column,separator):
            split_row = row[target_column].split(separator)
            for s in split_row:
                new_row = row.to_dict()
                new_row[target_column] = s
                row_accumulator.append(new_row)
        new_rows = []
        df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
        new_df = pandas.DataFrame(new_rows)
        return new_df
    
    0 讨论(0)
  • 2020-11-21 05:23

    Here's a function I wrote for this common task. It's more efficient than the Series/stack methods. Column order and names are retained.

    def tidy_split(df, column, sep='|', keep=False):
        """
        Split the values of a column and expand so the new DataFrame has one split
        value per row. Filters rows where the column is missing.
    
        Params
        ------
        df : pandas.DataFrame
            dataframe with the column to split and expand
        column : str
            the column to split and expand
        sep : str
            the string used to split the column's values
        keep : bool
            whether to retain the presplit value as it's own row
    
        Returns
        -------
        pandas.DataFrame
            Returns a dataframe with the same columns as `df`.
        """
        indexes = list()
        new_values = list()
        df = df.dropna(subset=[column])
        for i, presplit in enumerate(df[column].astype(str)):
            values = presplit.split(sep)
            if keep and len(values) > 1:
                indexes.append(i)
                new_values.append(presplit)
            for value in values:
                indexes.append(i)
                new_values.append(value)
        new_df = df.iloc[indexes, :].copy()
        new_df[column] = new_values
        return new_df
    

    With this function, the original question is as simple as:

    tidy_split(a, 'var1', sep=',')
    
    0 讨论(0)
提交回复
热议问题