Why Mongo Spark connector returns different and incorrect counts for a query?

前端未结

关注

 2  488

忘掉有多难 2021-01-12 11:07

I\'m evaluating Mongo Spark connector for a project and I\'m getting the inconsistent results. I use MongoDB server version 3.4.5, Spark (via PySpark) version 2.2.0, Mongo S

2条回答

生来不讨喜 (楼主)

2021-01-12 11:10
I solved my issue. The reason of inconsistent counts was the MongoDefaultPartitioner which wraps MongoSamplePartitioner which uses random sampling. To be honest this is quite a weird default as for me. I personally would prefer to have a slow but a consistent partitioner instead. The details for partitioner options can be found in the official configuration options documentation.

code:
```
val df = spark.read
  .format("com.mongodb.spark.sql.DefaultSource")
  .option("uri", "mongodb://127.0.0.1/enron_mail.messages")
  .option("partitioner", "spark.mongodb.input.partitionerOptions.MongoPaginateBySizePartitioner ")
  .load()
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...