MongoDB Aggregation Performance

前端 未结 1 858
野性不改
野性不改 2021-01-31 04:25

We have a problem of aggregation queries running long time (couple of minutes).

Collection:

We have a collection of 250 million documents with about 20 fields pe

相关标签:
1条回答
  • 2021-01-31 04:52
    1. Why the pipelining of the aggregation takes so much memory?

    Just performing a $match won't have to read the actual data, it can be done on the indexes. Through the projection's access of field1, the actual document will have to be read, and it will probably be cached as well.

    Also, grouping can be expensive. Normally, it should report an error if your grouping stage requires more than 100M of memory - what version are you using? It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups. I guess the key reason for the memory increase is the former.

    1. How can we increase our performance for it to run on a reasonable time for HTTP request?

    Your dtKey appears to encode time, and the grouping is also done based on time. I'd try to exploit that fact - for instance, by precomputing aggregates for each day and our_id combination - makes a lot of sense if there's no more criteria and the data doesn't change much anymore.

    Otherwise, I'd try to move the {"our_id":"111111111"} criterion to the first position, because equality should always precede range queries. I guess the query optimizer of the aggregation framework is smart enough, but it's worth a try. Also, you might want to try turning your two indexes into a single compound index { our_id, dtkey }. Index intersections are supported now, but I'm not sure how efficient that is, really. Use the built-in profile and .explain() to analyze your query.

    Lastly, MongoDB is designed for write-heavy use and scanning data sets of hundreds of GB from disk in a matter of milliseconds isn't feasible computationally at all. If your dataset is larger than your RAM, you'll be facing massive IO delays on the scale of tens of milliseconds and upwards, tens or hundreds of thousands of times, because of all the required disk operations. Remember that with random access you'll never get even close to the theoretical sequential disk transfer rates. If you can't precompute, I guess you'll need a lot more RAM. Maybe SSDs help, but that is all just guesswork.

    0 讨论(0)
提交回复
热议问题