发表新帖

发表新帖

Pyspark add sequential and deterministic index to dataframe

后端未结

关注

 1  1352

I need to add an index column to a dataframe with three very simple constraints:

start from 0
be sequential
be deterministi

相关标签:

1条回答

栀梦

2020-11-30 13:43
What I mean is: how can I add a column with an ordered, monotonically increasing by 1 sequence 0:df.count? (from comments)

You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().
```
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window

df = df.withColumn(
    "index",
    row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)
```
Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.

I don't want to zip with index and then have to separate the previously separated columns that are now in a single column

You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:
```
cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题