I have a Python Pandas DataFrame like this (UCSC schema for NCBI RefSeq):
chrom exonStart exonEnds name
chr1 100,200,300 110,210,310 gen1
c
This is one way using numpy
and intertools.chain
.
The idea is to first split your comma separated fields into lists. Then construct a results dataframe, repeating or chaining values were necessary.
import numpy as np
from itertools import chain
df['exonStart'] = df['exonStart'].str.split(',')
df['exonEnds'] = df['exonEnds'].str.split(',')
lens = list(map(len, df['exonStart']))
res = pd.DataFrame({'chrom': np.repeat(df['chrom'], lens),
'exonStart': list(chain.from_iterable(df['exonStart'])),
'exonEnds': list(chain.from_iterable(df['exonEnds'])),
'name': np.repeat(df['name'], lens)})
print(res)
# chrom exonEnds exonStart name
# 0 chr1 110 100 gen1
# 0 chr1 210 200 gen1
# 0 chr1 310 300 gen1
# 1 chr1 600 500 gen2
# 1 chr1 800 700 gen2
# 2 chr2 55 50 gen3
# 2 chr2 65 60 gen3
# 2 chr2 75 70 gen3
# 2 chr2 85 80 gen3
Note you may wish to convert your numeric columns to int
at the end of this process.
Use a zip
and split
within a comprehension
pd.DataFrame([
[c, s, e, n]
for c, S, E, n in df.itertuples(index=False)
for s, e in zip(S.split(','), E.split(','))
], columns=df.columns)
chrom exonStart exonEnds name
0 chr1 100 110 gen1
1 chr1 200 210 gen1
2 chr1 300 310 gen1
3 chr1 500 600 gen2
4 chr1 700 800 gen2
5 chr2 50 55 gen3
6 chr2 60 65 gen3
7 chr2 70 75 gen3
8 chr2 80 85 gen3
I come up with this , by usingunstack
and stack
df.set_index(['chrom','name']).apply(lambda x : x.str.split(','),1).\
stack().apply(pd.Series).stack().unstack(-2).\
reset_index().drop('level_2',1)
Out[1201]:
chrom name exonStart exonEnds
0 chr1 gen1 100 110
1 chr1 gen1 200 210
2 chr1 gen1 300 310
3 chr1 gen2 500 600
4 chr1 gen2 700 800
5 chr2 gen3 50 55
6 chr2 gen3 60 65
7 chr2 gen3 70 75
8 chr2 gen3 80 85