Spark Dataframe :How to add a index Column : Aka Distributed Data Index

后端 未结 7 2074
我寻月下人不归
我寻月下人不归 2020-11-27 18:49

I read data from a csv file ,but don\'t have index.

I want to add a column from 1 to row\'s number.

What should I do,Thanks (scala)

相关标签:
7条回答
  • 2020-11-27 19:22

    NOTE : Above approaches doesn't give a sequence number, but it does give increasing id.

    Simple way to do that and ensure the order of indexes is like below.. zipWithIndex.

    Sample data.

    +-------------------+
    |               Name|
    +-------------------+
    |     Ram Ghadiyaram|
    |        Ravichandra|
    |              ilker|
    |               nick|
    |             Naveed|
    |      Gobinathan SP|
    |Sreenivas Venigalla|
    |     Jackela Kowski|
    |   Arindam Sengupta|
    |            Liangpi|
    |             Omar14|
    |        anshu kumar|
    +-------------------+
    

        package com.example
    
    import org.apache.spark.internal.Logging
    import org.apache.spark.sql.SparkSession._
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types.{LongType, StructField, StructType}
    import org.apache.spark.sql.{DataFrame, Row}
    
    /**
      * DistributedDataIndex : Program to index an RDD  with
      */
    object DistributedDataIndex extends App with Logging {
    
      val spark = builder
        .master("local[*]")
        .appName(this.getClass.getName)
        .getOrCreate()
    
      import spark.implicits._
    
      val df = spark.sparkContext.parallelize(
        Seq("Ram Ghadiyaram", "Ravichandra", "ilker", "nick"
          , "Naveed", "Gobinathan SP", "Sreenivas Venigalla", "Jackela Kowski", "Arindam Sengupta", "Liangpi", "Omar14", "anshu kumar"
        )).toDF("Name")
      df.show
      logInfo("addColumnIndex here")
      // Add index now...
      val df1WithIndex = addColumnIndex(df)
        .withColumn("monotonically_increasing_id", monotonically_increasing_id)
      df1WithIndex.show(false)
    
      /**
        * Add Column Index to dataframe to each row
        */
      def addColumnIndex(df: DataFrame) = {
        spark.sqlContext.createDataFrame(
          df.rdd.zipWithIndex.map {
            case (row, index) => Row.fromSeq(row.toSeq :+ index)
          },
          // Create schema for index column
          StructType(df.schema.fields :+ StructField("index", LongType, false)))
      }
    }
    

    Result :

    +-------------------+-----+---------------------------+
    |Name               |index|monotonically_increasing_id|
    +-------------------+-----+---------------------------+
    |Ram Ghadiyaram     |0    |0                          |
    |Ravichandra        |1    |8589934592                 |
    |ilker              |2    |8589934593                 |
    |nick               |3    |17179869184                |
    |Naveed             |4    |25769803776                |
    |Gobinathan SP      |5    |25769803777                |
    |Sreenivas Venigalla|6    |34359738368                |
    |Jackela Kowski     |7    |42949672960                |
    |Arindam Sengupta   |8    |42949672961                |
    |Liangpi            |9    |51539607552                |
    |Omar14             |10   |60129542144                |
    |anshu kumar        |11   |60129542145                |
    +-------------------+-----+---------------------------+
    
    0 讨论(0)
提交回复
热议问题