Querying on multiple Hive stores using Apache Spark

前端未结

关注

 2  1551

I have a spark application which will successfully connect to hive and query on hive tables using spark engine.

To build this, I just added hive-site.xml


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-12-24 12:29
              
            
            
                                                                       
This doesn't seem to be possible in the current version of Spark. Reading the HiveContext code in the Spark Repo it appears that hive.metastore.uris is something that is configurable for many Metastores, but it appears to be used only for redundancy across the same metastore, not totally different metastores. 

More information here https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

But you will probably have to aggregate the data somewhere in order to work on it in unison. Or you could create multiple Spark Contexts for each store.

You could try configuring the hive.metastore.uris for multiple different metastores, but it probably won't work. If you do decide to create multiple Spark contexts for each store than make sure you set spark.driver.allowMultipleContexts but this is generally discouraged and may lead to unexpected results.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  失恋的感觉        
                
              
                            
                2020-12-24 12:33
              
            
            
                                                                       
I think this is possible by making use of Spark SQL capability of connecting and reading data from remote databases using JDBC.
After an exhaustive R & D, I was successfully able to connect to two different hive environments using JDBC and load the hive tables as DataFrames into Spark for further processing.
Environment details
hadoop-2.6.0
apache-hive-2.0.0-bin
spark-1.3.1-bin-hadoop2.6
Code Sample  HiveMultiEnvironment.scala
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
object HiveMultiEnvironment {
  def main(args: Array[String]) {
    var conf = new SparkConf().setAppName("JDBC").setMaster("local")
    var sc = new SparkContext(conf)
    var sqlContext = new SQLContext(sc)

    // load hive table (or) sub-query from Environment 1

    val jdbcDF1 = sqlContext.load("jdbc", Map(
      "url" -> "jdbc:hive2://<host1>:10000/<db>",
      "dbtable" -> "<db.tablename or subquery>",
      "driver" -> "org.apache.hive.jdbc.HiveDriver",
      "user" -> "<username>",
      "password" -> "<password>"))
    jdbcDF1.foreach { println }
      
    // load hive table (or) sub-query from Environment 2

    val jdbcDF2 = sqlContext.load("jdbc", Map(
      "url" -> "jdbc:hive2://<host2>:10000/<db>",
      "dbtable" -> "<db.tablename> or <subquery>",
      "driver" -> "org.apache.hive.jdbc.HiveDriver",
      "user" -> "<username>",
      "password" -> "<password>"))
    jdbcDF2.foreach { println }
  }
  // todo: business logic
}

Other parameters can also be set during load using SqlContext such as setting partitionColumn. Details found under 'JDBC To Other Databases' section in Spark reference doc:
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
Build path from Eclipse:

What I Haven't Tried
Use of HiveContext for Environment 1 and SqlContext for environment 2
Hope this will be useful.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复