Removing white spaces (leading and trailing) from string value

后端 未结 4 455
情歌与酒
情歌与酒 2020-12-10 00:00

I have imported a csv file in mongo using mongoimport and I want to remove leading and trailing white spaces from my string value.

Is it possible directly in mongo t

相关标签:
4条回答
  • 2020-12-10 00:17

    Small correction to the answer from Neil for bulk operations api

    it is

    initializeOrderedBulkOp
    

    not

    initializeBulkOrderedOp
    

    also you missed to

    counter++;
    

    inside the forEach, so in summary

    var counter = 1;
    var bulk = db.collection.initializeOrderedBulkOp();
    db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1}).forEach(
        function(doc) {
            bulk.find({ "_id": doc._id }).update({
                "$set": { "category": doc.category.trim() }
            });
    
            if ( counter % 1000 == 0 ) {
                bulk.execute();
                counter = 1;
            }
            counter++;
        }
    );
    
    if ( counter > 1 )
        bulk.execute();
    

    Note: I don't have enough reputation to comment, hence adding an answer

    0 讨论(0)
  • 2020-12-10 00:23

    It is not currently possible for an update in MongoDB to refer to the existing value of a current field when applying the update. So you are going to have to loop:

    db.collection.find({},{ "category": 1 }).forEach(function(doc) {
       doc.category = doc.category.trim();
       db.collection.update(
           { "_id": doc._id },
           { "$set": { "category": doc.category } }
       );
    })
    

    Noting the use of the $set operator there and the projected "category" field only in order to reduce network traffic"

    You might limit what that processes with a $regex to match:

    db.collection.find({ 
        "$and": [
            { "category": /^\s+/ },
            { "category": /\s+$/ }
        ]
    })
    

    Or even as pure $regex without the use of $and which you only need in MongoDB where multiple conditions would be applied to the same field. Otherwise $and is implicit to all arguments:

    db.collection.find({ "category": /^\s+|\s+$/ })
    

    Which restricts the matched documents to process to only those with leading or trailing white-space.

    If you are worried about the number of documents to look, bulk updating should help if you have MongoDB 2.6 or greater available:

    var batch = [];
    db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1 }).forEach(
        function(doc) {
            batch.push({
                "q": { "_id": doc._id },
                "u": { "$set": { "category": doc.catetgory.trim() } }
            });
    
            if ( batch.length % 1000 == 0 ) {
                db.runCommand("update", batch);
                batch = [];
            }
        }
    );
    
    if ( batch.length > 0 )
        db.runCommand("update", batch);
    

    Or even with the bulk operations API for MongoDB 2.6 and above:

    var counter = 0;
    var bulk = db.collection.initializeOrderedBulkOp();
    db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1}).forEach(
        function(doc) {
            bulk.find({ "_id": doc._id }).update({
                "$set": { "category": doc.category.trim() }
            });
            counter = counter + 1;
    
            if ( counter % 1000 == 0 ) {
                bulk.execute();
                bulk = db.collection.initializeOrderedBulkOp();
            }
        }
    );
    
    if ( counter > 1 )
        bulk.execute();
    

    Best done with bulkWrite() for modern API's which uses the Bulk Operations API ( technically everything does now ) but actually in a way that is safely regressive with older versions of MongoDB. Though in all honesty that would mean prior to MongoDB 2.6 and you would be well out of coverage for official support options using such a version. The coding is somewhat cleaner for this:

    var batch = [];
    db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1}).forEach(
      function(doc) {
        batch.push({
          "updateOne": {
            "filter": { "_id": doc._id },
            "update": { "$set": { "category": doc.category.trim() } }
          }
        });
    
        if ( batch.legth % 1000 == 0 ) {
          db.collection.bulkWrite(batch);
          batch = [];
        }
      }
    );
    
    if ( batch.length > 0 ) {
      db.collection.bulkWrite(batch);
      batch = [];
    }
    

    Which all only send operations to the server once per 1000 documents, or as many modifications as you can fit under the 64MB BSON limit.

    As just a few ways to approach the problem. Or update your CSV file first before importing.

    0 讨论(0)
  • 2020-12-10 00:32
    • Starting Mongo 4.2, db.collection.update() can accept an aggregation pipeline, finally allowing the update of a field based on its own value.

    • Starting Mongo 4.0, the $trim operator can be applied on a string to remove its leading/trailing white spaces:

    // { category: "Financial & Legal Services " }
    // { category: " IT  " }
    db.collection.update(
      {},
      [{ $set: { category: { $trim: { input: "$category" } } } }],
      { multi: true }
    )
    // { category: "Financial & Legal Services" }
    // { category: "IT" }
    

    Note that:

    • The first part {} is the match query, filtering which documents to update (in this case all documents).

    • The second part [{ $set: { category: { $trim: { input: "$category" } } } }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline):

      • $set is a new aggregation operator which in this case replaces the value for "category".
      • With $trim we modify and trim the value for "category".
      • Note that $trim can take an optional parameter chars which allows specifying which characters to trim.
    • Don't forget { multi: true }, otherwise only the first matching document will be updated.

    0 讨论(0)
  • 2020-12-10 00:33

    You can execute javascript in an MongoDB update command when it's in a cursor method:

    db.collection.find({},{ "category": 1 }).forEach(function(doc) {
      db.collection.update(
        { "_id": doc._id },
        { "$set": { "category": doc.category.trim() } }
      );
    })
    

    If you have a ton of records and need to batch process, you might want to look at the other answers here.

    0 讨论(0)
提交回复
热议问题