How does computing table stats in hive or impala speed up queries in Spark SQL?

前端未结
关注
 3  1480
For increasing performance (e.g. for joins) it is recommended to compute table statics first.
In Hive I can do::
analyze table  c         

              相关标签:
       

        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-28 23:10
              
            
            
                                                                       
This is the upcoming Spark 2.3.0 here (perhaps some of the features have already been released in 2.2.1 or ealier).


  Does my spark application (reading from hive-tables) also benefit from pre-computed statistics?


It could if Impala or Hive recorded the table statistics (e.g. table size or row count) in a Hive metastore in the table metadata that Spark can read from (and translate to its own Spark statistics for query planning).

You can easily check it out by using DESCRIBE EXTENDED SQL command in spark-shell.

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |id        |
|data_type     |int       |
|comment       |NULL      |
|min           |0         |
|max           |1         |
|num_nulls     |0         |
|distinct_count|2         |
|avg_col_len   |4         |
|max_col_len   |4         |
|histogram     |NULL      |
+--------------+----------+


ANALYZE TABLE COMPUTE STATISTICS noscan computes one statistic that Spark uses, i.e. the total size of a table (with no row count metric due to noscan option). If Impala and Hive recorded it to a "proper" location, Spark SQL would show it in DESC EXTENDED.

Use DESC EXTENDED tableName for table-level statistics and see if you find the ones that were generated by Impala or Hive. If they are in DESC EXTENDED's output they will be used for optimizing joins (and with cost-based optimization turned on also for aggregations and filters).



Column statistics are stored (in a Spark-specific serialized format) in table properties and I really doubt that Impala or Hive could compute the stats and store them in the Spark SQL-compatible format.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2020-12-28 23:24
              
            
            
                                                                       
I am assuming you are using Hive on Spark (or) Spark-Sql with hive context. If that is the case, you should run analyze in hive. 

Analyze table<...> typically needs to run after the table is created or if there are significant inserts/changes. You can do this at the end of your load step itself, if this is a MR or spark job.

At the time of analysis, if you are using hive on spark - please also use the configurations in the link below. You can set this at the session level for each query. I have used the parameters in this link https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started in production and it works fine.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  星月不相逢        
                
              
                            
                2020-12-28 23:29
              
            
            
                                                                       
From what i understand compute stats on impala is the latest implementation and frees you from tuning hive settings.

From official doc:


  If you use the Hive-based methods of gathering statistics, see the
  Hive wiki for information about the required configuration on the Hive
  side. Cloudera recommends using the Impala COMPUTE STATS statement to
  avoid potential configuration and scalability issues with the
  statistics-gathering process.
  
  If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR
  COLUMNS, Impala can only use the resulting column statistics if the
  table is unpartitioned. Impala cannot use Hive-generated column
  statistics for a partitioned table.


Useful link:
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_perf_stats.html
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复
            
          
        
      

          
 
     
 
        热议问题