How can I use a cursor.forEach() in MongoDB using Node.js?

前端 未结 10 681
你的背包
你的背包 2020-11-27 14:04

I have a huge collection of documents in my DB and I\'m wondering how can I run through all the documents and update them, each document with a different value.

相关标签:
10条回答
  • 2020-11-27 14:39

    Using the mongodb driver, and modern NodeJS with async/await, a good solution is to use next():

    const collection = db.collection('things')
    const cursor = collection.find({
      bla: 42 // find all things where bla is 42
    });
    let document;
    while ((document = await cursor.next())) {
      await collection.findOneAndUpdate({
        _id: document._id
      }, {
        $set: {
          blu: 43
        }
      });
    }
    

    This results in only one document at a time being required in memory, as opposed to e.g. the accepted answer, where many documents get sucked into memory, before processing of the documents starts. In cases of "huge collections" (as per the question) this may be important.

    If documents are large, this can be improved further by using a projection, so that only those fields of documents that are required are fetched from the database.

    0 讨论(0)
  • 2020-11-27 14:43

    let's assume that we have the below MongoDB data in place.

    Database name: users
    Collection name: jobs
    ===========================
    Documents
    { "_id" : ObjectId("1"), "job" : "Security", "name" : "Jack", "age" : 35 }
    { "_id" : ObjectId("2"), "job" : "Development", "name" : "Tito" }
    { "_id" : ObjectId("3"), "job" : "Design", "name" : "Ben", "age" : 45}
    { "_id" : ObjectId("4"), "job" : "Programming", "name" : "John", "age" : 25 }
    { "_id" : ObjectId("5"), "job" : "IT", "name" : "ricko", "age" : 45 }
    ==========================
    

    This code:

    var MongoClient = require('mongodb').MongoClient;
    var dbURL = 'mongodb://localhost/users';
    
    MongoClient.connect(dbURL, (err, db) => {
        if (err) {
            throw err;
        } else {
            console.log('Connection successful');
            var dataBase = db.db();
            // loop forEach
            dataBase.collection('jobs').find().forEach(function(myDoc){
            console.log('There is a job called :'+ myDoc.job +'in Database')})
    });
    
    0 讨论(0)
  • 2020-11-27 14:46

    The node-mongodb-native now supports a endCallback parameter to cursor.forEach as for one to handle the event AFTER the whole iteration, refer to the official document for details http://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html#forEach.

    Also note that .each is deprecated in the nodejs native driver now.

    0 讨论(0)
  • 2020-11-27 14:49

    I looked for a solution with good performance and I end up creating a mix of what I found which I think works good:

    /**
     * This method will read the documents from the cursor in batches and invoke the callback
     * for each batch in parallel.
     * IT IS VERY RECOMMENDED TO CREATE THE CURSOR TO AN OPTION OF BATCH SIZE THAT WILL MATCH
     * THE VALUE OF batchSize. This way the performance benefits are maxed out since
     * the mongo instance will send into our process memory the same number of documents
     * that we handle in concurrent each time, so no memory space is wasted
     * and also the memory usage is limited.
     *
     * Example of usage:
     * const cursor = await collection.aggregate([
         {...}, ...],
         {
            cursor: {batchSize: BATCH_SIZE} // Limiting memory use
        });
     DbUtil.concurrentCursorBatchProcessing(cursor, BATCH_SIZE, async (doc) => ...)
     * @param cursor - A cursor to batch process on.
     * We can get this from our collection.js API by either using aggregateCursor/findCursor
     * @param batchSize - The batch size, should match the batchSize of the cursor option.
     * @param callback - Callback that should be async, will be called in parallel for each batch.
     * @return {Promise<void>}
     */
    static async concurrentCursorBatchProcessing(cursor, batchSize, callback) {
        let doc;
        const docsBatch = [];
    
        while ((doc = await cursor.next())) {
            docsBatch.push(doc);
    
            if (docsBatch.length >= batchSize) {
                await PromiseUtils.concurrentPromiseAll(docsBatch, async (currDoc) => {
                    return callback(currDoc);
                });
    
                // Emptying the batch array
                docsBatch.splice(0, docsBatch.length);
            }
        }
    
        // Checking if there is a last batch remaining since it was small than batchSize
        if (docsBatch.length > 0) {
            await PromiseUtils.concurrentPromiseAll(docsBatch, async (currDoc) => {
                return callback(currDoc);
            });
        }
    }
    

    An example of usage for reading many big documents and updating them:

            const cursor = await collection.aggregate([
            {
                ...
            }
        ], {
            cursor: {batchSize: BATCH_SIZE}, // Limiting memory use 
            allowDiskUse: true
        });
    
        const bulkUpdates = [];
    
        await DbUtil.concurrentCursorBatchProcessing(cursor, BATCH_SIZE, async (doc: any) => {
            const update: any = {
                updateOne: {
                    filter: {
                        ...
                    },
                    update: {
                       ...
                    }
                }
            };            
    
            bulkUpdates.push(update);
    
            // Updating if we read too many docs to clear space in memory
            await this.bulkWriteIfNeeded(bulkUpdates, collection);
        });
    
        // Making sure we updated everything
        await this.bulkWriteIfNeeded(bulkUpdates, collection, true);
    

    ...

        private async bulkWriteParametersIfNeeded(
        bulkUpdates: any[], collection: any,
        forceUpdate = false, flushBatchSize) {
    
        if (bulkUpdates.length >= flushBatchSize || forceUpdate) {
            // concurrentPromiseChunked is a method that loops over an array in a concurrent way using lodash.chunk and Promise.map
            await PromiseUtils.concurrentPromiseChunked(bulkUpsertParameters, (upsertChunk: any) => {
                return techniquesParametersCollection.bulkWrite(upsertChunk);
            });
    
            // Emptying the array
            bulkUpsertParameters.splice(0, bulkUpsertParameters.length);
        }
    }
    
    0 讨论(0)
  • 2020-11-27 14:50

    None of the previous answers mentions batching the updates. That makes them extremely slow

    0 讨论(0)
  • 2020-11-27 14:51

    The answer depends on the driver you're using. All MongoDB drivers I know have cursor.forEach() implemented one way or another.

    Here are some examples:

    node-mongodb-native

    collection.find(query).forEach(function(doc) {
      // handle
    }, function(err) {
      // done or error
    });
    

    mongojs

    db.collection.find(query).forEach(function(err, doc) {
      // handle
    });
    

    monk

    collection.find(query, { stream: true })
      .each(function(doc){
        // handle doc
      })
      .error(function(err){
        // handle error
      })
      .success(function(){
        // final callback
      });
    

    mongoose

    collection.find(query).stream()
      .on('data', function(doc){
        // handle doc
      })
      .on('error', function(err){
        // handle error
      })
      .on('end', function(){
        // final callback
      });
    

    Updating documents inside of .forEach callback

    The only problem with updating documents inside of .forEach callback is that you have no idea when all documents are updated.

    To solve this problem you should use some asynchronous control flow solution. Here are some options:

    • async
    • promises (when.js, bluebird)

    Here is an example of using async, using its queue feature:

    var q = async.queue(function (doc, callback) {
      // code for your update
      collection.update({
        _id: doc._id
      }, {
        $set: {hi: 'there'}
      }, {
        w: 1
      }, callback);
    }, Infinity);
    
    var cursor = collection.find(query);
    cursor.each(function(err, doc) {
      if (err) throw err;
      if (doc) q.push(doc); // dispatching doc to async.queue
    });
    
    q.drain = function() {
      if (cursor.isClosed()) {
        console.log('all items have been processed');
        db.close();
      }
    }
    
    0 讨论(0)
提交回复
热议问题