Pseudocolumn in Spark JDBC

后端未结

关注

 2  1596

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.format(\"jdbc\")
         .option(\"url\", \"jdbc:mysql://10.0.0.192:3306/retai


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  故里飘歌        
                
              
                            
                2020-12-04 03:17
              
            
            
                                                                       
As per Spark's official documentation the partitionColumn can be any numeric column (not necessarily primary key column). 


  partitionColumn must be a numeric column from the table in question.


Reference: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-12-04 03:28
              
            
            
                                                                       

  can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2) 


TL;DR Probably no.

While Spark doesn't consider constraints like PRIMARY KEY or UNIQUE there is very important requirement for partitionColumn, which is not explicitly stated in the documentation - it has to be deterministic.

Each executor fetches it's own piece of data using separate transaction. If numeric column is not deterministic (stable, preserved between transactions), the state of data seen by Spark might be inconsistent and records might be duplicated or skipped.

Because ROWNUM implementations are usually volatile (depend on non stable ordering and can be affected by features like indexing) there not safe choice for partitionColumn. For the same reason you cannot use random numbers.

Also, some vendors might further limit allowed operations on pseudocolumns, making them unsuitable for usage as a partitioning column. For example Oracle ROWNUM


  Conditions testing for ROWNUM values greater than a positive integer are always false.


might fail silently leading to incorrect results.




  can we specify a partition column which is not a primary key 


Yes, as long it satisfies criteria described above.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复