Convert cells in dataframe with multiple values to multiple rows

前端 未结 3 1472
无人及你
无人及你 2021-01-14 15:17

My data is like this:

Name    test1     test2      Count
Emp1    X,Y        A           1
Emp2    X          A,B,C       2
Emp3    Z          C           3
<         


        
相关标签:
3条回答
  • 2021-01-14 15:25

    I don't believe it is that straightforward to adapt this answer highlighted by @wen to this question, so I'll propose a solution.

    You might create a function that takes a df, a column to be expanded and a separator for that column, and chain calls as many times as needed.

    def expand(df, col, sep=','):
        r = df[col].str.split(sep)
        d = {c: df[c].values.repeat(r.str.len(), axis=0) for c in df.columns}
        d[col] = [i for sub in r for i in sub]
        return pd.DataFrame(d)
    
    expand(expand(df, 'test1'), 'test2')
    
        Name    test1   test2   Count
    0   Emp1    X       A       1
    1   Emp1    Y       A       1
    2   Emp2    X       A       2
    3   Emp2    X       B       2
    4   Emp2    X       C       2
    5   Emp3    Z       C       3
    

    Suppose you have a

    df['test3'] = ['X1|X2|X3', 'X4', 'X5']
    

    such that

    >>> print(df)
    
        Name    test1   test2   Count   test3
    0   Emp1    X,Y     A       1       X1|X2|X3
    1   Emp2    X       A,B,C   2       X4
    2   Emp3    Z       C       3       X5
    

    Then,

    >>> expand(df,'test3', '|')
    
        Name    test1   test2   Count   test3
    0   Emp1    X,Y     A       1       X1
    1   Emp1    X,Y     A       1       X2
    2   Emp1    X,Y     A       1       X3
    3   Emp2    X       A,B,C   2       X4
    4   Emp3    Z       C       3       X5
    

    If you think columns size may increase substantially, you can define a function expand_all to avoid having something like expand(expand(expand(expand(........)))))). For example:

    def expand_all(df, cols, seps):
        ret = df
        for c,s in zip(cols,seps): ret = expand(ret,c,s)
        return ret
    
    >>> expand_all(df, ['test1', 'test2', 'test3'], [',', ',', '|'])
    
        Name    test1   test2   Count   test3
    0   Emp1    X       A       1       X1
    1   Emp1    X       A       1       X2
    2   Emp1    X       A       1       X3
    3   Emp1    Y       A       1       X1
    4   Emp1    Y       A       1       X2
    5   Emp1    Y       A       1       X3
    6   Emp2    X       A       2       X4
    7   Emp2    X       B       2       X4
    8   Emp2    X       C       2       X4
    9   Emp3    Z       C       3       X5
    

    Or however suitable ;)


    Detail:

    >>> expand(df, 'test1')
    
        Name    test1   test2   Count
    0   Emp1    X       A       1
    1   Emp1    Y       A       1
    2   Emp2    X       A,B,C   2
    3   Emp3    Z       C       3
    
    >>> expand(df, 'test2')
    
        Name    test1   test2   Count
    0   Emp1    X,Y     A       1
    1   Emp2    X       A       2
    2   Emp2    X       B       2
    3   Emp2    X       C       2
    4   Emp3    Z       C       3
    
    >>> expand(expand(df, 'test2'), 'test1') 
    
        Name    test1   test2   Count
    0   Emp1    X       A       1
    1   Emp1    Y       A       1
    2   Emp2    X       A       2
    3   Emp2    X       B       2
    4   Emp2    X       C       2
    5   Emp3    Z       C       3
    
    
    >>> expand(expand(df, 'test2'), 'test1').eq(expand(expand(df, 'test1'), 'test2')).all()
    
    Name     True
    test1    True
    test2    True
    Count    True
    dtype: bool
    
    0 讨论(0)
  • Comprehension

    pd.DataFrame(
        [(n, a, b, c)
         for n, A, B, c in zip(*map(df.get, df))
         for a in A.split(',') for b in B.split(',')],
        columns=df.columns
    )
    
       Name test1 test2  Count
    0  Emp1     X     A      1
    1  Emp1     Y     A      1
    2  Emp2     X     A      2
    3  Emp2     X     B      2
    4  Emp2     X     C      2
    5  Emp3     Z     C      3
    
    0 讨论(0)
  • 2021-01-14 15:41

    I am just fix your code , since I do not recommend the method you unnesting the dataframe , you can check the answer here, there are multiple nice way.

    df2 = df.test1.str.split(',').apply(pd.Series)
    df2.index = df.set_index(['Name', 'Count']).index
    df2=df2.stack().reset_index(['Name', 'Count'])
    df3 = df.test2.str.split(',').apply(pd.Series)
    df3.index = df.set_index(['Name', 'Count']).index
    df3=df3.stack().reset_index(['Name', 'Count'])
    

    Just do merge here

    df2.merge(df3,on=['Name', 'Count'],how='outer')
    Out[132]: 
       Name  Count 0_x 0_y
    0  Emp1      1   X   A
    1  Emp1      1   Y   A
    2  Emp2      2   X   A
    3  Emp2      2   X   B
    4  Emp2      2   X   C
    5  Emp3      3   Z   C
    
    0 讨论(0)
提交回复
热议问题