Why Spark SQL considers the support of indexes unimportant?

前端 未结 2 1539
走了就别回头了
走了就别回头了 2021-01-30 11:03

Quoting the Spark DataFrames, Datasets and SQL manual:

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) ar

2条回答
  •  走了就别回头了
    2021-01-30 11:12

    Indexing input data

    • The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
    • If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.

    Indexing Distributed Data Structures:

    • standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
    • high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.

    That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.

    Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.

    Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

提交回复
热议问题