Best way to read and update mongodb documents using pymongo

问题

iam trying to read a mongodb collection document by document in order to fetch every record encrypt some of fields in the record and put it back to database.

for record in coll.find():
    #modifying record here
    coll.update(record)

this is causing a serious problem i.e already updated documents are read again by cursor and same document is processed again in loop (same document is trying to update again)

hope this may be one of the solution to the problem.

list_coll = [record for record in coll.find()]
for rec in list_coll:
   #modifying record
   coll.update(rec)

but is this the best way of doing? i.e what happens if the collection is large ? can large list_coll causes ram overflow? kindly suggest me a best way of doing it.

thanks

回答1:

You want the "Bulk Operations API" from MongoDB. Mostly introduced with MongoDB 2.6, so a compelling reason to be upgrading if you currently have not.

bulk = db.coll.initialize_ordered_bulk_op()
counter = 0

for record in coll.find(snapshot=True):
    # now process in bulk
    # calc value first
    bulk.find({ '_id': record['_id'] }).update({ '$set': { 'field': newValue } })
    counter += 1

    if counter % 1000 == 0:
        bulk.execute()
        bulk = db.coll.initialize_ordered_bulk_op()

if counter % 1000 != 0:
    bulk.execute()

Much better as you are not sending "every" request to the server, just once in every 1000 requests. The "Bulk API" actually sorts this out for you somewhat, but really you want to "manage" this a little better and not consume too much memory in your app.

Way of the future. Use it.

回答2:

If your collection isn't sharded you can isolate your find cursor from seeing the same doc again after it's updated by using the snapshot parameter:

for record in coll.find(snapshot = True):
    #modifying record here
    coll.update(record)

If your collection is sharded, keep a hash variable of the _id values that you've already updated and then check that list before you modify each record to ensure you don't update the same one twice.

回答3:

Mark each record as updated, e.g. by adding a flag or by making sure that the updated field has a certain form that can be matched by a query.

Use the query to match only documents that were not updated yet, and double check each document as you iterate.

Why?

Because the collection may be too large to manage updated IDs in a local hash
Because your process might crash and leave the collection in a half-updated state. You may want to be able to resume it.

If this is a one-time job on a non-sharded collection, consider using a snapshot query.

来源：https://stackoverflow.com/questions/25485042/best-way-to-read-and-update-mongodb-documents-using-pymongo

标签

python

mongodb

pymongo

mongodb-query