Distributed loading of a wide row into Spark from Cassandra

后端未结

关注

 1  1450

Let\'s assume we have a Cassandra cluster with RF = N and a table containing wide rows.

Our table could have an index something like this: pk / ck1 / ck2 / ...


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2021-01-15 23:11
              
            
            
                                                                       
For the sake of future reference, I will explain how I solved this.

I actually used a slightly different method to the one outlined above, one which does not involve calling Cassandra from inside Spark tasks. 

I started off with ck_list, a list of distinct values for the first cluster key when pk = PK. The code is not shown here, but I actually downloaded this list directly from Cassandra in the Spark driver using CQL.

I then transform ck_list into a list of RDDS. Next we combine the RDDs (each one representing a Cassandra row slice) into one unified RDD (wide_row).

The cast on CassandraRDD is necessary because union returns type org.apache.spark.rdd.RDD

After running the job I was able to verify that the wide_row had x partitions where x is the size of ck_list. A useful side effect is that wide_row is partitioned by the first cluster key, which is also the key I want to reduce by. Hence even more shuffling is avoided.

I don't know if this is the best way to achieve what I wanted, but it certainly works.

val ck_list // list first cluster key values where pk = PK

val wide_row = ck_list.map( ck =>
  sc.cassandraTable(KS, TBL)
    .select("c1", "c2").where("pk = ? and ck1 = ?", PK, ck)
    .asInstanceOf[org.apache.spark.rdd.RDD] 
).reduce( (x, y) => x.union(y) )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复