Updating large number of records in a collection

后端未结

关注

 1  1936

甜味超标 2020-12-25 09:05

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed

1条回答

小蘑菇 (楼主)

2020-12-25 09:24
Let me give you a couple of hints based on my global knowledge and experience:

Use shorter field names

MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.

Pros:
- Less size of the documents, so less disk space
- More documennt to fit in RAM (more caching)
- Size of the do indexes will be less in some scenario
Cons:
- Less readable names
Optimize on index size

The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.

Understand padding factor

For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.

We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.

You can see the collection padding factor using:
```
db.collection.stats().paddingFactor
```
Add a padding manually

In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.

Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.

You can get details about the query causing document move:
```
db.system.profile.find({ moved: { $exists : true } })
```
Large number of collections VS large number of documents in few collection

Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.

Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.

Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.

Denormalization of data

Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.

Pros:
- index size will be very less as number of index entries will be less
- query will be very fast which includes fetching all necessary details
- document size will be comparable to page size which means when we bring this data in RAM, most of the time we are not bringing other data along the page
- document move will make sure that we are freeing a page, not a small tiny chunk in the page which may not be used in further inserts
Capped Collections

Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).
0 讨论(0)
发布评论:

提交评论
- 加载中...

Updating large number of records in a collection

Use shorter field names

Optimize on index size

Understand padding factor

Add a padding manually

Large number of collections VS large number of documents in few collection

Denormalization of data

Capped Collections