Spark Parquet Statistics(min/max) integration

后端未结

关注

 3  1438

I have been looking into how Spark stores statistics (min/max) in Parquet as well as how it uses the info for query optimization. I have got a few questions. First setup: Sp

相关标签:

3条回答

日久生厌

2021-01-02 14:06

For the first question, I believe this is a matter of definition (what would be the min/max of a string? lexical ordering?) but in any case as far as I know, spark's parquet currently only indexes numbers.

As for the second question, I believe that if you look deeper you would see that spark is not loading the files themselves. Instead it is reading the metadata so it knows whether to read a block or not. So basically it is pushing the predicate to the file (block) level.

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2021-01-02 14:17

PARQUET-686 made changes to intentionally ignore statistics on binary field when it seems to be appropriate. You can override this behavior by setting parquet.strings.signed-min-max.enabled to true.

After setting that config, you can read min/max in binary field with parquet-tools.

More details in my another stackoverflow question

0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2021-01-02 14:17

This has been resolved in Spark-2.4.0 version. In here they have upgraded parquet version from 1.8.2 to 1.10.0.

[SPARK-23972] Update Parquet from 1.8.2 to 1.10.0

With these all column types, whether they are Int/String/Decimal will contain min/max statistics.

0 讨论(0)
发布评论:

提交评论
- 加载中...