pandas: records with lists to separate rows

前端 未结 3 1151
花落未央
花落未央 2021-01-16 07:05

I have a Python Pandas DataFrame like this (UCSC schema for NCBI RefSeq):

chrom   exonStart     exonEnds      name
chr1    100,200,300   110,210,310   gen1
c         


        
相关标签:
3条回答
  • 2021-01-16 07:47

    This is one way using numpy and intertools.chain.

    The idea is to first split your comma separated fields into lists. Then construct a results dataframe, repeating or chaining values were necessary.

    import numpy as np
    from itertools import chain
    
    df['exonStart'] = df['exonStart'].str.split(',')
    df['exonEnds'] = df['exonEnds'].str.split(',')
    
    lens = list(map(len, df['exonStart']))
    
    res = pd.DataFrame({'chrom': np.repeat(df['chrom'], lens),
                        'exonStart': list(chain.from_iterable(df['exonStart'])),
                        'exonEnds': list(chain.from_iterable(df['exonEnds'])),
                        'name': np.repeat(df['name'], lens)})
    
    print(res)
    
    #   chrom exonEnds exonStart  name
    # 0  chr1      110       100  gen1
    # 0  chr1      210       200  gen1
    # 0  chr1      310       300  gen1
    # 1  chr1      600       500  gen2
    # 1  chr1      800       700  gen2
    # 2  chr2       55        50  gen3
    # 2  chr2       65        60  gen3
    # 2  chr2       75        70  gen3
    # 2  chr2       85        80  gen3
    

    Note you may wish to convert your numeric columns to int at the end of this process.

    0 讨论(0)
  • 2021-01-16 08:06

    Use a zip and split within a comprehension

    pd.DataFrame([
        [c, s, e, n]
        for c, S, E, n in df.itertuples(index=False)
        for s, e in zip(S.split(','), E.split(','))
    ], columns=df.columns)
    
      chrom exonStart exonEnds  name
    0  chr1       100      110  gen1
    1  chr1       200      210  gen1
    2  chr1       300      310  gen1
    3  chr1       500      600  gen2
    4  chr1       700      800  gen2
    5  chr2        50       55  gen3
    6  chr2        60       65  gen3
    7  chr2        70       75  gen3
    8  chr2        80       85  gen3
    
    0 讨论(0)
  • 2021-01-16 08:07

    I come up with this , by usingunstack and stack

    df.set_index(['chrom','name']).apply(lambda x : x.str.split(','),1).\
       stack().apply(pd.Series).stack().unstack(-2).\
           reset_index().drop('level_2',1)
    Out[1201]: 
      chrom  name exonStart exonEnds
    0  chr1  gen1       100      110
    1  chr1  gen1       200      210
    2  chr1  gen1       300      310
    3  chr1  gen2       500      600
    4  chr1  gen2       700      800
    5  chr2  gen3        50       55
    6  chr2  gen3        60       65
    7  chr2  gen3        70       75
    8  chr2  gen3        80       85
    
    0 讨论(0)
提交回复
热议问题