DataFrame-ified zipWithIndex

后端 未结 8 1365
悲哀的现实
悲哀的现实 2020-11-27 04:23

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD

相关标签:
8条回答
  • 2020-11-27 04:48

    Since Spark 1.6 there is a function called monotonically_increasing_id()
    It generates a new column with unique 64-bit monotonic index for each row
    But it isn't consequential, each partition starts a new range, so we must calculate each partition offset before using it.
    Trying to provide an "rdd-free" solution, I ended up with some collect(), but it only collects offsets, one value per partition, so it will not cause OOM

    def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
        val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())
    
        val partitionOffsets = dfWithPartitionId
            .groupBy("partition_id")
            .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
            .orderBy("partition_id")
            .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
            .collect()
            .map(_.getLong(0))
            .toArray
            
         dfWithPartitionId
            .withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))
            .withColumn(indexName, col("partition_offset") + col("inc_id"))
            .drop("partition_id", "partition_offset", "inc_id")
    }

    This solution doesn't repack the original rows and doesn't repartition the original huge dataframe, so it is quite fast in real world: 200GB of CSV data (43 million rows with 150 columns) read, indexed and packed to parquet in 2 minutes on 240 cores
    After testing my solution, I have run Kirk Broadhurst's solution and it was 20 seconds slower
    You may want or not want to use dfWithPartitionId.cache(), depends on task

    0 讨论(0)
  • 2020-11-27 04:54

    Here is my proposal, the advantages of which are:

    • It does not involve any serialization/deserialization[1] of our DataFrame's InternalRows.
    • Its logic is minimalist by relying only on RDD.zipWithIndex.

    Its major down sides are:

    • It is impossible to use it directly from non-JVM APIs (pySpark, SparkR).
    • It has to be under the package org.apache.spark.sql;.

    imports:

    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.catalyst.InternalRow
    import org.apache.spark.sql.execution.LogicalRDD
    import org.apache.spark.sql.functions.lit
    
    /**
      * Optimized Spark SQL equivalent of RDD.zipWithIndex.
      *
      * @param df
      * @param indexColName
      * @return `df` with a column named `indexColName` of consecutive unique ids.
      */
    def zipWithIndex(df: DataFrame, indexColName: String = "index"): DataFrame = {
      import df.sparkSession.implicits._
    
      val dfWithIndexCol: DataFrame = df
        .drop(indexColName)
        .select(lit(0L).as(indexColName), $"*")
    
      val internalRows: RDD[InternalRow] = dfWithIndexCol
        .queryExecution
        .toRdd
        .zipWithIndex()
        .map {
          case (internalRow: InternalRow, index: Long) =>
            internalRow.setLong(0, index)
            internalRow
        }
    
      Dataset.ofRows(
        df.sparkSession,
        LogicalRDD(dfWithIndexCol.schema.toAttributes, internalRows)(df.sparkSession)
      )
    
    

    [1]: (from/to InternalRow's underlying bytes array <--> GenericRow's underlying JVM objects collection Array[Any]).

    0 讨论(0)
提交回复
热议问题