问题
There are many ways to select random document from a mongodb collection (as discussed in this answer). Comments point out that with mongodb version >= 3.2 then using $sample in the aggregation framework is preferred. However, on a collection with many small documents this seems to extremely slow.
The following code uses mongoengine to simulate the issue and compare it to the "skip random" method:
import timeit
from random import randint
import mongoengine as mdb
mdb.connect("test-agg")
class ACollection(mdb.Document):
name = mdb.StringField(unique=True)
meta = {'indexes': ['name']}
ACollection.drop_collection()
ACollection.objects.insert([ACollection(name="Document {}".format(n)) for n in range(50000)])
def agg():
doc = list(ACollection.objects.aggregate({"$sample": {'size': 1}}))[0]
print(doc['name'])
def skip_random():
n = ACollection.objects.count()
doc = ACollection.objects.skip(randint(1, n)).limit(1)[0]
print(doc['name'])
if __name__ == '__main__':
print("agg took {:2.2f}s".format(timeit.timeit(agg, number=1)))
print("skip_random took {:2.2f}s".format(timeit.timeit(skip_random, number=1)))
The result is:
Document 44551
agg took 21.89s
Document 25800
skip_random took 0.01s
Wherever I've had performance issues with mongodb in the past my answer has always been to use the aggregation framework so I'm surprised $sample
is so slow.
Am I missing something here? What is it about this example that is causing the aggregation to take so long?
回答1:
I can confirm that nothing has changed in 3.6 Slow $sample problem persists. Shame on you, MongoDB team.
~40m collection of small documents, no indexes, Windows Server 2012 x64.
storage: wiredTiger.engineConfig.journalCompressor: zlib wiredTiger.collectionConfig.blockCompressor: zlib
2018-04-02T02:27:27.743-0700 I COMMAND [conn4] command maps.places
command: aggregate { aggregate: "places", pipeline: [ { $sample: { size: 10 } } ],
cursor: {}, lsid: { id: UUID("0e846097-eecd-40bb-b47c-d77f1484dd7e") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: MULTI_ITERATOR keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:3967 nreturned:10 reslen:550 locks:{ Global: { acquireCount: { r: 7942 } }, Database: { acquireCount: { r: 3971 } }, Collection: { acquireCount: { r: 3971 } } }
protocol:op_query 72609ms
I have installed Mongo to try this "modern and performant DBMS" in a serious project. How deeply I am frustrated.
Explain plan is here:
db.command('aggregate', 'places', pipeline=[{"$sample":{"size":10}}], explain=True)
{'ok': 1.0,
'stages': [{'$cursor': {'query': {},
'queryPlanner': {'indexFilterSet': False,
'namespace': 'maps.places',
'plannerVersion': 1,
'rejectedPlans': [],
'winningPlan': {'stage': 'MULTI_ITERATOR'}}}},
{'$sampleFromRandomCursor': {'size': 10}}]}
回答2:
This is a result of a known bug in the WiredTiger engine in versions of mongodb < 3.2.3. Upgrading to the latest version should solve this.
来源:https://stackoverflow.com/questions/37679999/mongodb-aggregation-with-sample-very-slow