Splitting multiple columns into rows in pandas dataframe

后端 未结 5 1671
独厮守ぢ
独厮守ぢ 2020-12-03 09:16

I have a pandas dataframe as follows:

ticker    account      value         date
aa       assets       100,200       20121231, 20131231
bb       liabilities           


        
相关标签:
5条回答
  • 2020-12-03 09:29

    Pandas >= 0.25

    df.value = df.value.str.split(',')
    df.date = df.date.str.split(',')
    df = df.explode('value').explode("date").reset_index(drop=True)
    

    df:

        ticker  account      value  date
    0   aa      assets       100    20121231
    1   aa      assets       100    20131231
    2   aa      assets       200    20121231
    3   aa      assets       200    20131231
    4   bb      liabilities  50     20141231
    5   bb      liabilities  50     20131231
    6   bb      liabilities  50     20141231
    7   bb      liabilities  50     20131231
    
    0 讨论(0)
  • 2020-12-03 09:35

    You can first split columns, create Series by stack and remove whitespaces by strip:

    s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
    s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
    

    Then concat both Series to df1:

    df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])
    

    Remove old columns value and date and join:

    print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
      ticker      account value      date
    0     aa       assets   100  20121231
    1     aa       assets   200  20131231
    2     bb  liabilities    50  20141231
    3     bb  liabilities   150  20131231
    
    0 讨论(0)
  • 2020-12-03 09:35

    Because I'm too new, I'm not allowed to write a comment, so I write an "answer".

    @titipata your answer worked really good, but in my opinion there is a small "mistake" in your code I'm not able to find for my self.

    I work with the example from this question and changed just the values.

    df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'],
                   ['title2', 'publisher2', '2', '2.1,2.2']],
                  columns=['titel', 'publisher', 'print', 'electronic'])
    
    explode(df, ['print', 'electronic'])
    
        publisher   titel   print   electronic
    0   publisher1  title1  1.1     1
    1   publisher1  title1  1.2     2.1
    2   publisher2  title2  2       2.2
    

    As you see, in the column 'electronic' should be in row '1' the value '1' and not '2.1'.

    Because of that, the hole DataSet would change. I hope someone could help me to find a solution for this.

    0 讨论(0)
  • 2020-12-03 09:37

    I wrote explode function based on previous answers. It might be useful for anyone who want to grab and use it quickly.

    def explode(df, cols, split_on=','):
        """
        Explode dataframe on the given column, split on given delimeter
        """
        cols_sep = list(set(df.columns) - set(cols))
        df_cols = df[cols_sep]
        explode_len = df[cols[0]].str.split(split_on).map(len)
        repeat_list = []
        for r, e in zip(df_cols.as_matrix(), explode_len):
            repeat_list.extend([list(r)]*e)
        df_repeat = pd.DataFrame(repeat_list, columns=cols_sep)
        df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True)
                                for col in cols], axis=1)
        df_explode.columns = cols
        return pd.concat((df_repeat, df_explode), axis=1)
    

    example given from @piRSquared:

    df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
                       ['bb', 'liabilities', '50,50', '20141231,20131231']],
                      columns=['ticker', 'account', 'value', 'date'])
    explode(df, ['value', 'date'])
    

    output

    +-----------+------+-----+--------+
    |    account|ticker|value|    date|
    +-----------+------+-----+--------+
    |     assets|    aa|  100|20121231|
    |     assets|    aa|  200|20131231|
    |liabilities|    bb|   50|20141231|
    |liabilities|    bb|   50|20131231|
    +-----------+------+-----+--------+
    
    0 讨论(0)
  • 2020-12-03 09:41

    I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:

    • https://stackoverflow.com/a/38432346/2336654
    • https://stackoverflow.com/a/38499036/2336654

    So I wrote a function that will do it.

    def explode(df, columns):
        idx = np.repeat(df.index, df[columns[0]].str.len())
        a = df.T.reindex_axis(columns).values
        concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
        p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
        return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
    

    But before we can use it, we need lists (or iterable) in a column.

    Setup

    df = pd.DataFrame([['aa', 'assets',      '100,200', '20121231,20131231'],
                       ['bb', 'liabilities', '50,50',   '20141231,20131231']],
                      columns=['ticker', 'account', 'value', 'date'])
    
    df
    

    split value and date columns:

    df.value = df.value.str.split(',')
    df.date = df.date.str.split(',')
    
    df
    

    Now we could explode on either column or both, one after the other.

    Solution

    explode(df, ['value','date'])
    


    Timing

    I removed strip from @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.

    code

    def get_df(n=1):
        return pd.DataFrame([['aa', 'assets',      '100,200,200', '20121231,20131231,20131231'],
                             ['bb', 'liabilities', '50,50',   '20141231,20131231']] * n,
                            columns=['ticker', 'account', 'value', 'date'])
    

    small 2 row sample

    medium 200 row sample

    large 2,000,000 row sample

    0 讨论(0)
提交回复
热议问题