Spark + Hive : Number of partitions scanned exceeds limit (=4000)

问题

We upgraded our Hadoop Platform (Spark; 2.3.0, Hive: 3.1), and I'm facing this exception when reading some Hive tables in Spark : "Number of partitions scanned on table 'my_table' exceeds limit (=4000)".

Tables we are working on :
table1 : external table with a total of ~12300 partitions, partitioned by(col1: String, date1: String) , (ORC compressed ZLIB)
table2 : external table with a total of 4585 partitions, partitioned by(col21: String, date2: Date, col22: String) (ORC uncompressed)

[A] Knowing that we had set this spark conf:
--conf "spark.hadoop.metastore.catalog.default=hive"
We execute in spark :
[1] spark.sql("select * from table1 where col1 = 'value1' and date1 = '2020-06-03'").count
=> Error : Number of partitions scanned (=12300) on table 'table1' exceeds limit (=4000)
[2] spark.sql("select * from table2 where col21 = 'value21' and col22 = 'value22'").count
[3] spark.sql("select * from table2 where col21 = 'value21' and date2 = '2020-06-03' and col22 = 'value22'").count
=> Error on [2] and [3] : Number of partitions scanned (=4585) on table 'table2' exceeds limit (=4000)

[B] We solved the problem by adding this spark conf :

--conf "spark.sql.hive.convertMetastoreOrc=false"

resulting in automatically activating : --conf "spark.sql.hive.metastorePartitionPruning=true"

Re-executing in spark :
[1] and [2] => Success
[3] => Error : Number of partitions scanned (=4585) on table 'table2' exceeds limit (=4000)

[C] To solve the error on [3], we set

--conf "spark.sql.hive.convertMetastoreOrc=false" 
--conf "spark.sql.hive.metastorePartitionPruning=false"

Re-executing in spark :
[3] => Success
on the other hand, if we recall [1] : performance is degraded, it takes so mush time to execute and we don't want that..

In conclusion:
In case [B], we think that a partition can not be of a type "Date", when it's a String it is OK.
But why ? What's going on ? Aren't we supposed to have a partition of other types than String type when partition Pruning is activated ?
Why does it work in case [C] ? and how could we solve case [B][3] without degrading performance of [1] ?

Hoping that it's clear, please let me know if you need other information!

Thank you if you can help or advise !

来源：https://stackoverflow.com/questions/62180078/spark-hive-number-of-partitions-scanned-exceeds-limit-4000

标签

scala

apache-spark

Hadoop

Hive

hdp