Pyspark add sequential and deterministic index to dataframe

后端 未结 1 1352
梦毁少年i
梦毁少年i 2020-11-30 13:16

I need to add an index column to a dataframe with three very simple constraints:

  • start from 0

  • be sequential

  • be deterministi

相关标签:
1条回答
  • 2020-11-30 13:43

    What I mean is: how can I add a column with an ordered, monotonically increasing by 1 sequence 0:df.count? (from comments)

    You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().

    from pyspark.sql.functions import row_number, monotonically_increasing_id
    from pyspark.sql import Window
    
    df = df.withColumn(
        "index",
        row_number().over(Window.orderBy(monotonically_increasing_id()))-1
    )
    

    Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.


    I don't want to zip with index and then have to separate the previously separated columns that are now in a single column

    You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:

    cols = df.columns
    df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols
    
    0 讨论(0)
提交回复
热议问题