I have a dataframe df:
+------+----------+--------------------+
|SiteID| LastRecID| Col_to_split|
+------+----------+--------------------+
| 2|105
As of Spark 2.1.0, you can use posexplode
which unnest array column and output the index for each element as well, (used data from @Herve):
import pyspark.sql.functions as F
df.select(
F.col("LastRecID").alias("RecID"),
F.posexplode(F.col("coltosplit")).alias("index", "value")
).show()
+-----+-----+-----+
|RecID|index|value|
+-----+-----+-----+
|10526| 0| 214|
|10526| 1| 207|
|10526| 2| 206|
|10526| 3| 205|
|10896| 0| 213|
|10896| 1| 208|
+-----+-----+-----+
I quickly tried with Spark 2.0 You can change the query a little bit if you want to order differently.
d = [{'SiteID': '2', 'LastRecId': 10526, 'coltosplit': [214,207,206,205]}, {'SiteID': '2', 'LastRecId': 10896, 'coltosplit': [213,208]}]
df = spark.createDataFrame(d)
+---------+------+--------------------+
|LastRecId|SiteID| coltosplit|
+---------+------+--------------------+
| 10526| 2|[214, 207, 206, 205]|
| 10896| 2| [213, 208]|
+---------+------+--------------------+
query = """
select LastRecId as RecID,
(row_number() over (partition by LastRecId order by 1)) - 1 as index,
t as Value
from test
LATERAL VIEW explode(coltosplit) test AS t
"""
df.createTempView("test")
spark.sql(query).show()
+-----+-----+-----+
|RecID|index|Value|
+-----+-----+-----+
|10896| 0| 213|
|10896| 1| 208|
|10526| 0| 214|
|10526| 1| 207|
|10526| 2| 206|
|10526| 3| 205|
+-----+-----+-----+
So basically I just explode the list into a new column. And apply row number on this column.
Hope this helps