Why Spark SQL considers the support of indexes unimportant?

前端未结

关注

 2  1539

走了就别回头了 2021-01-30 11:03

Quoting the Spark DataFrames, Datasets and SQL manual:

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) ar

2条回答

走了就别回头了 (楼主)

2021-01-30 11:12
Indexing input data
- The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
- If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
- standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
- high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.

Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.

Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...