DataFrame-ified zipWithIndex

后端未结

关注

 8  1365

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2020-11-27 04:48
              
            
            
                                                                       
Since Spark 1.6 there is a function called monotonically_increasing_id()

It generates a new column with unique 64-bit monotonic index for each row

But it isn't consequential, each partition starts a new range, so we must calculate each partition offset before using it.

Trying to provide an "rdd-free" solution, I ended up with some collect(), but it only collects offsets, one value per partition, so it will not cause OOM
def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())

    val partitionOffsets = dfWithPartitionId
        .groupBy("partition_id")
        .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
        .orderBy("partition_id")
        .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
        .collect()
        .map(_.getLong(0))
        .toArray
        
     dfWithPartitionId
        .withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))
        .withColumn(indexName, col("partition_offset") + col("inc_id"))
        .drop("partition_id", "partition_offset", "inc_id")
}
This solution doesn't repack the original rows and doesn't repartition the original huge dataframe, so it is quite fast in real world:
200GB of CSV data (43 million rows with 150 columns) read, indexed and packed to parquet in 2 minutes on 240 cores

After testing my solution, I have run Kirk Broadhurst's solution and it was 20 seconds slower

You may want or not want to use dfWithPartitionId.cache(), depends on task
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2020-11-27 04:54
              
            
            
                                                                       
Here is my proposal, the advantages of which are:


It does not involve any serialization/deserialization^[1] of our DataFrame's InternalRows.
Its logic is minimalist by relying only on RDD.zipWithIndex.


Its major down sides are: 


It is impossible to use it directly from non-JVM APIs (pySpark, SparkR).
It has to be under the package org.apache.spark.sql;.


imports:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.functions.lit


/**
  * Optimized Spark SQL equivalent of RDD.zipWithIndex.
  *
  * @param df
  * @param indexColName
  * @return `df` with a column named `indexColName` of consecutive unique ids.
  */
def zipWithIndex(df: DataFrame, indexColName: String = "index"): DataFrame = {
  import df.sparkSession.implicits._

  val dfWithIndexCol: DataFrame = df
    .drop(indexColName)
    .select(lit(0L).as(indexColName), $"*")

  val internalRows: RDD[InternalRow] = dfWithIndexCol
    .queryExecution
    .toRdd
    .zipWithIndex()
    .map {
      case (internalRow: InternalRow, index: Long) =>
        internalRow.setLong(0, index)
        internalRow
    }

  Dataset.ofRows(
    df.sparkSession,
    LogicalRDD(dfWithIndexCol.schema.toAttributes, internalRows)(df.sparkSession)
  )





^[1]: (from/to InternalRow's underlying bytes array <--> GenericRow's underlying JVM objects collection Array[Any]).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复