Best way to read and update mongodb documents using pymongo

筅森魡賤 提交于 2019-12-23 08:35:51

问题


iam trying to read a mongodb collection document by document in order to fetch every record encrypt some of fields in the record and put it back to database.

for record in coll.find():
    #modifying record here
    coll.update(record)

this is causing a serious problem i.e already updated documents are read again by cursor and same document is processed again in loop (same document is trying to update again)

hope this may be one of the solution to the problem.

list_coll = [record for record in coll.find()]
for rec in list_coll:
   #modifying record
   coll.update(rec)

but is this the best way of doing? i.e what happens if the collection is large ? can large list_coll causes ram overflow? kindly suggest me a best way of doing it.

thanks


回答1:


You want the "Bulk Operations API" from MongoDB. Mostly introduced with MongoDB 2.6, so a compelling reason to be upgrading if you currently have not.

bulk = db.coll.initialize_ordered_bulk_op()
counter = 0

for record in coll.find(snapshot=True):
    # now process in bulk
    # calc value first
    bulk.find({ '_id': record['_id'] }).update({ '$set': { 'field': newValue } })
    counter += 1

    if counter % 1000 == 0:
        bulk.execute()
        bulk = db.coll.initialize_ordered_bulk_op()

if counter % 1000 != 0:
    bulk.execute()

Much better as you are not sending "every" request to the server, just once in every 1000 requests. The "Bulk API" actually sorts this out for you somewhat, but really you want to "manage" this a little better and not consume too much memory in your app.

Way of the future. Use it.




回答2:


If your collection isn't sharded you can isolate your find cursor from seeing the same doc again after it's updated by using the snapshot parameter:

for record in coll.find(snapshot = True):
    #modifying record here
    coll.update(record)

If your collection is sharded, keep a hash variable of the _id values that you've already updated and then check that list before you modify each record to ensure you don't update the same one twice.




回答3:


Mark each record as updated, e.g. by adding a flag or by making sure that the updated field has a certain form that can be matched by a query.

Use the query to match only documents that were not updated yet, and double check each document as you iterate.

Why?

  • Because the collection may be too large to manage updated IDs in a local hash

  • Because your process might crash and leave the collection in a half-updated state. You may want to be able to resume it.

If this is a one-time job on a non-sharded collection, consider using a snapshot query.



来源:https://stackoverflow.com/questions/25485042/best-way-to-read-and-update-mongodb-documents-using-pymongo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!