This is the upcoming Spark 2.3.0 here (perhaps some of the features have already been released in 2.2.1 or ealier).
Does my spark application (reading from hive-tables) also benefit from pre-computed statistics?
It could if Impala or Hive recorded the table statistics (e.g. table size or row count) in a Hive metastore in the table metadata that Spark can read from (and translate to its own Spark statistics for query planning).
You can easily check it out by using DESCRIBE EXTENDED
SQL command in spark-shell
.
scala> spark.version
res0: String = 2.4.0-SNAPSHOT
scala> sql("DESC EXTENDED t1 id").show
+--------------+----------+
|info_name |info_value|
+--------------+----------+
|col_name |id |
|data_type |int |
|comment |NULL |
|min |0 |
|max |1 |
|num_nulls |0 |
|distinct_count|2 |
|avg_col_len |4 |
|max_col_len |4 |
|histogram |NULL |
+--------------+----------+
ANALYZE TABLE COMPUTE STATISTICS noscan
computes one statistic that Spark uses, i.e. the total size of a table (with no row count metric due to noscan
option). If Impala and Hive recorded it to a "proper" location, Spark SQL would show it in DESC EXTENDED
.
Use DESC EXTENDED tableName
for table-level statistics and see if you find the ones that were generated by Impala or Hive. If they are in DESC EXTENDED
's output they will be used for optimizing joins (and with cost-based optimization turned on also for aggregations and filters).
Column statistics are stored (in a Spark-specific serialized format) in table properties and I really doubt that Impala or Hive could compute the stats and store them in the Spark SQL-compatible format.