I need to add an index column to a dataframe with three very simple constraints:
start from 0
be sequential
be deterministi
What I mean is: how can I add a column with an ordered, monotonically increasing by 1 sequence 0:df.count? (from comments)
You can use row_number()
here, but for that you'd need to specify an orderBy()
. Since you don't have an ordering column, just use monotonically_increasing_id()
.
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window
df = df.withColumn(
"index",
row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)
Also, row_number()
starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1
.
I don't want to zip with index and then have to separate the previously separated columns that are now in a single column
You can use zipWithIndex
if you follow it with a call to map
, to avoid having all of the separated columns turn into a single column:
cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols