How to partition Spark RDD when importing Postgres using JDBC?

前端未结
关注
 1  1744
情歌与酒 2020-12-17 02:06
I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don\'t want to use the value

      
      
        
          1条回答        

        
                    
            
            
                         
                
              
              
                
                   醉梦人生
                                             
                
                
                (楼主)
            
              
              
                2020-12-17 02:44
              

            
            
                        
Since you already know you can partition by a numeric column this is probably what you should do. Here is the trick. First lets find a minimum and maximum epoch:

url = ...
properties = ...

min_max_query = """(
    SELECT
        CAST(min(extract(epoch FROM timestamp)) AS bigint), 
        CAST(max(extract(epoch FROM timestamp)) AS bigint)
    FROM tablename
) tmp"""

min_epoch, max_epoch = spark.read.jdbc(
    url=url, table=min_max_query, properties=properties
).first()


and use it to query the table:

numPartitions = ...

query = """(
    SELECT *, CAST(extract(epoch FROM timestamp) AS bigint) AS epoch
    FROM tablename) AS tmp"""

spark.read.jdbc(
    url=url, table=query,
    lowerBound=min_epoch, upperBound=max_epoch + 1, 
    column="epoch", numPartitions=numPartitions, properties=properties
).drop("epoch")


Since this splits data into ranges of the same size it is relatively sensitive to data skew so you should use it with caution.

You could also provide a list of disjoint predicates as a predicates argument.

predicates= [
    "id BETWEEN 'a' AND 'c'",
    "id BETWEEN 'd' AND 'g'",
    ...   # Continue to get full coverage an desired number of predicates
]

spark.read.jdbc(
    url=url, table="tablename", properties=properties, 
    predicates=predicates
)


The latter approach is much more flexible and can address certain issues with non-uniform data distribution but requires more knowledge about the data.

Using partitionBy fetches data first and then performs full shuffle to get desired number of partitions so it is relativistically expensive.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                    
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复