MongoDB Spark Connector - aggregation is slow

后端 未结 1 1363
感动是毒
感动是毒 2021-02-10 10:51

I am running the same aggregation pipeline with a Spark Application and on the Mongos console. On the console, the data is fetched within the blink of an eye, and only a second

相关标签:
1条回答
  • 2021-02-10 11:31

    The high number of tasks is caused by the default Mongo Spark partitioner strategy. It ignores the aggregation pipeline when calculating the partitions, for two main reasons:

    1. It reduces the cost of calculating partitions
    2. Ensures the same behaviour for sharded and non-sharded partitioners

    However, as you've found they can generate empty partitions which in your case is costly.

    The choices for fixing could be:

    1. Change partitioning strategy

      For choose an alternative partitioner to reduce the number of partitions. For example the PaginateByCount will split the database into a set number of partitions.

      Create your own partitioner - simply implement the trait and you will be able to apply the aggregation pipeline and partition up the results. See the HalfwayPartitioner and custom partitioner test for an example.

    2. Pre aggregate the results into a collection using $out and read from there.

    3. Use coalesce(N) to coalesce the partitions together and reduce the number of partitions.
    4. Increase the spark.mongodb.input.partitionerOptions.partitionSizeMB configuration to produce fewer partitions.

    A custom partitioner should produce the best solution but there are ways to make better use of the available default partitioners.

    If you think there should be a default partitioner that uses the aggregation pipeline to calculate the partitions then please add a ticket to the MongoDB Spark Jira project.

    0 讨论(0)
提交回复
热议问题