Spark Dataframe :How to add a index Column : Aka Distributed Data Index

后端 未结 7 2073
我寻月下人不归
我寻月下人不归 2020-11-27 18:49

I read data from a csv file ,but don\'t have index.

I want to add a column from 1 to row\'s number.

What should I do,Thanks (scala)

相关标签:
7条回答
  • 2020-11-27 18:55

    With Scala you can use:

    import org.apache.spark.sql.functions._ 
    
    df.withColumn("id",monotonicallyIncreasingId)
    

    You can refer to this exemple and scala docs.

    With Pyspark you can use:

    from pyspark.sql.functions import monotonically_increasing_id 
    
    df_index = df.select("*").withColumn("id", monotonically_increasing_id())
    
    0 讨论(0)
  • 2020-11-27 19:04

    How to get a sequential id column id[1, 2, 3, 4...n]:

    from pyspark.sql.functions import desc, row_number, monotonically_increasing_id
    from pyspark.sql.window import Window
    
    df_with_seq_id = df.withColumn('index_column_name', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
    

    Note that row_number() starts at 1, therefore subtract by 1 if you want 0-indexed column

    0 讨论(0)
  • 2020-11-27 19:04

    As Ram said, zippedwithindex is better than monotonically increasing id, id you need consecutive row numbers. Try this (PySpark environment):

    from pyspark.sql import Row
    from pyspark.sql.types import StructType, StructField, LongType
    
    new_schema = StructType(**original_dataframe**.schema.fields[:] + [StructField("index", LongType(), False)])
    zipped_rdd = **original_dataframe**.rdd.zipWithIndex()
    indexed = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))
    

    where original_dataframe is the dataframe you have to add index on and row_with_index is the new schema with the column index which you can write as

    row_with_index = Row(
    "calendar_date"
    ,"year_week_number"
    ,"year_period_number"
    ,"realization"
    ,"index"
    )
    

    Here, calendar_date, year_week_number, year_period_number and realization were the columns of my original dataframe. You can replace the names with the names of your columns. index is the new column name you had to add for the row numbers.

    0 讨论(0)
  • 2020-11-27 19:04

    If you require a unique sequence number for each row, I have a slightly different approach, where a static column is added and is used to compute the row number using that column.

    val srcData = spark.read.option("header","true").csv("/FileStore/sample.csv")
    srcData.show(5)
    
    +--------+--------------------+
    |     Job|                Name|
    +--------+--------------------+
    |Morpheus|       HR Specialist|
    |   Kayla|              Lawyer|
    |  Trisha|          Bus Driver|
    |  Robert|Elementary School...|
    |    Ober|               Judge|
    +--------+--------------------+
    
    val srcDataModf = srcData.withColumn("sl_no",lit("1"))
    val windowSpecRowNum =  Window.partitionBy("sl_no").orderBy("sl_no")
    
    srcDataModf.withColumn("row_num",row_number.over(windowSpecRowNum)).drop("sl_no").select("row_num","Name","Job")show(5)
    
    +-------+--------------------+--------+
    |row_num|                Name|     Job|
    +-------+--------------------+--------+
    |      1|       HR Specialist|Morpheus|
    |      2|              Lawyer|   Kayla|
    |      3|          Bus Driver|  Trisha|
    |      4|Elementary School...|  Robert|
    |      5|               Judge|    Ober|
    +-------+--------------------+--------+
    
    0 讨论(0)
  • 2020-11-27 19:08

    For SparkR:

    (Assuming sdf is some sort of spark data frame)

    sdf<- withColumn(sdf, "row_id", SparkR:::monotonically_increasing_id())

    0 讨论(0)
  • 2020-11-27 19:22

    monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

    "I want to add a column from 1 to row's number."

    Let say we have the following DF

    +--------+-------------+-------+
    | userId | productCode | count |
    +--------+-------------+-------+
    |     25 |        6001 |     2 |
    |     11 |        5001 |     8 |
    |     23 |         123 |     5 |
    +--------+-------------+-------+
    

    To generate the IDs starting from 1

    val w = Window.orderBy("count")
    val result = df.withColumn("index", row_number().over(w))
    

    This would add an index column ordered by increasing value of count.

    +--------+-------------+-------+-------+
    | userId | productCode | count | index |
    +--------+-------------+-------+-------+
    |     25 |        6001 |     2 |     1 |
    |     23 |         123 |     5 |     2 |
    |     11 |        5001 |     8 |     3 |
    +--------+-------------+-------+-------+
    
    0 讨论(0)
提交回复
热议问题