I have an RDD in which each element is having the following format
[\'979500797\', \' 979500797,260973244733,2014-05-0402:05:12,645/01/105/9931,78,645/01/105/993
What you need here is a flatMap
. flatMap
takes function that returns sequence and concatenates the results.
df_feat3 = df_feat2.flatMap(lambda (x, y): ((x, v) for v in y.split(';')))
On a side note I would avoid using tuple parameters. It is a cool feature but it is no longer available in Python 3. See PEP 3113