I have a Python Pandas DataFrame like this (UCSC schema for NCBI RefSeq):
chrom exonStart exonEnds name
chr1 100,200,300 110,210,310 gen1
c
This is one way using numpy
and intertools.chain
.
The idea is to first split your comma separated fields into lists. Then construct a results dataframe, repeating or chaining values were necessary.
import numpy as np
from itertools import chain
df['exonStart'] = df['exonStart'].str.split(',')
df['exonEnds'] = df['exonEnds'].str.split(',')
lens = list(map(len, df['exonStart']))
res = pd.DataFrame({'chrom': np.repeat(df['chrom'], lens),
'exonStart': list(chain.from_iterable(df['exonStart'])),
'exonEnds': list(chain.from_iterable(df['exonEnds'])),
'name': np.repeat(df['name'], lens)})
print(res)
# chrom exonEnds exonStart name
# 0 chr1 110 100 gen1
# 0 chr1 210 200 gen1
# 0 chr1 310 300 gen1
# 1 chr1 600 500 gen2
# 1 chr1 800 700 gen2
# 2 chr2 55 50 gen3
# 2 chr2 65 60 gen3
# 2 chr2 75 70 gen3
# 2 chr2 85 80 gen3
Note you may wish to convert your numeric columns to int
at the end of this process.