Split (explode) pandas dataframe string entry to separate rows

后端 未结 22 3485
一向
一向 2020-11-21 05:03

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (as

22条回答
  •  梦毁少年i
    2020-11-21 05:17

    Based on the excellent @DMulligan's solution, here is a generic vectorized (no loops) function which splits a column of a dataframe into multiple rows, and merges it back to the original dataframe. It also uses a great generic change_column_order function from this answer.

    def change_column_order(df, col_name, index):
        cols = df.columns.tolist()
        cols.remove(col_name)
        cols.insert(index, col_name)
        return df[cols]
    
    def split_df(dataframe, col_name, sep):
        orig_col_index = dataframe.columns.tolist().index(col_name)
        orig_index_name = dataframe.index.name
        orig_columns = dataframe.columns
        dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
        index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
        df_split = pd.DataFrame(
            pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
            .stack().reset_index(level=1, drop=1), columns=[col_name])
        df = dataframe.drop(col_name, axis=1)
        df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
        df = df.set_index(index_col_name)
        df.index.name = orig_index_name
        # merge adds the column to the last place, so we need to move it back
        return change_column_order(df, col_name, orig_col_index)
    

    Example:

    df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], 
                      columns=['Name', 'A', 'B'], index=[10, 12, 13])
    df
            Name    A   B
        10   a:b     1   4
        12   c:d     2   5
        13   e:f:g:h 3   6
    
    split_df(df, 'Name', ':')
        Name    A   B
    10   a       1   4
    10   b       1   4
    12   c       2   5
    12   d       2   5
    13   e       3   6
    13   f       3   6    
    13   g       3   6    
    13   h       3   6    
    

    Note that it preserves the original index and order of the columns. It also works with dataframes which have non-sequential index.

提交回复
热议问题