Why does sortBy transformation trigger a Spark job?

前端未结
关注
 2  1333
悲&欢浪女 2020-11-30 13:56
As per Spark documentation only RDD actions can trigger a Spark job and the transformations are lazily evaluated when an action is called on it.
I see the sor

      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   有刺的猬
                                             
                
                
                (楼主)
            
              
              
                2020-11-30 14:21
              

            
            
                        

  As per Spark documentation only the action triggers a job in Spark, the transformations are lazily evaluated when an action is called on it.


In general you're right, but as you've just experienced, there are few exceptions and sortBy is among them (with zipWithIndex).

As a matter of fact, it was reported in Spark's JIRA and closed with Won't Fix resolution. See SPARK-1021 sortByKey() launches a cluster job when it shouldn't.

You can see the job running with DAGScheduler logging enabled (and later in web UI):

scala> sc.parallelize(0 to 8).sortBy(identity)
INFO DAGScheduler: Got job 1 (sortBy at :25) with 8 output partitions
INFO DAGScheduler: Final stage: ResultStage 1 (sortBy at :25)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
DEBUG DAGScheduler: submitStage(ResultStage 1)
DEBUG DAGScheduler: missing: List()
INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[4] at sortBy at :25), which has no missing parents
DEBUG DAGScheduler: submitMissingTasks(ResultStage 1)
INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 1 (MapPartitionsRDD[4] at sortBy at :25)
DEBUG DAGScheduler: New pending partitions: Set(0, 1, 5, 2, 6, 3, 7, 4)
INFO DAGScheduler: ResultStage 1 (sortBy at :25) finished in 0.013 s
DEBUG DAGScheduler: After removal of stage 1, remaining stages = 0
INFO DAGScheduler: Job 1 finished: sortBy at :25, took 0.019755 s
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at sortBy at :25

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复