Extracting a list of substrings from MongoDB using a Regular Expression

后端 未结 3 426
感动是毒
感动是毒 2021-01-15 05:53

I need to extract a part of a string that matches a regex and return it.

I have a set of documents such as:

{\"_id\" :12121, \"fileName\" : \"apple.d         


        
相关标签:
3条回答
  • 2021-01-15 06:04

    It's almost undoable to do it in the aggregation pipe, you want to project your matches and include only the part after the period. There is no (yet) operator to locate the position of the period. You need the position because $substr (https://docs.mongodb.com/manual/reference/operator/aggregation/substr/) requires a start position. In addition $regEx is only for matching, you cannot use it in a projection to replace.

    I think for now it's a easier to do it in code. here you could use a replace regex or any other solution provided by your language

    0 讨论(0)
  • 2021-01-15 06:30

    It will be possible to do this in the upcoming version of MongoDB(as the time of this writing) using the aggregation framework and the $indexOfCP operator. Until then, your best bet here is MapReduce.

    var mapper = function() { 
        emit(this._id, this.fileName.substring(this.fileName.indexOf(".")))
    };
    
    db.coll.mapReduce(mapper, 
                      function(key, value) {}, 
                      { "out": { "inline": 1 }}
    )["results"]
    

    Which yields:

    [
        {
            "_id" : 12121,
            "value" : ".doc"
        },
        {
            "_id" : 12125,
            "value" : ".txt"
        },
        {
            "_id" : 12126,
            "value" : ".pdf"
        },
        {
            "_id" : 12127,
            "value" : ".txt"
        }
    ]
    

    For completness here is the solution using the aggregation framework*

    db.coll.aggregate(
        [
            { "$match": { "name": /\.[0-9a-z]+$/i } },
            { "$group": { 
                "_id": null,
                "extension":  { 
                    "$push": {
                        "$substr": [ 
                            "$fileName", 
                            { "$indexOfCP": [ "$fileName", "." ] }, 
                            -1 
                        ]
                    }
                }
            }}
        ])
    

    which produces:

    { 
        "_id" : null, 
        "extensions" : [ ".doc", ".txt", ".pdf", ".txt" ] 
    }
    

    *current development version of MongoDB (as the time of this writing).

    0 讨论(0)
  • 2021-01-15 06:30

    Starting Mongo 4.2, the $regexFind aggregation operator makes things easier:

    // { _id : 12121, fileName: "apple.doc" }
    // { _id : 12125, fileName: "rap.txt" }
    // { _id : 12126, fileName: "tap.pdf" }
    // { _id : 12127, fileName: "cricket.txt" }
    // { _id : 12129, fileName: "oops" }
    db.collection.aggregate([
      { $set: { ext: { $regexFind: { input: "$fileName", regex: /\.\w+$/ } } } },
      { $group: { _id: null, extensions: { $addToSet: "$ext.match" } } }
    ])
    // { _id: null, extensions: [ ".doc", ".pdf", ".txt" ] }
    

    This makes use of:

    • The $set operator, which adds a new field to each the documents.
    • This new field (ext) is the result of the $regexFind operator, which captures the result of a matching regex. If a match is found, it returns a document that contains information on the first match. If a match is not found, returns null. For instance:
      • For { fileName: "tap.pdf" }, it produces { matches: { match: ".pdf", idx: 3, captures: [] }.
      • For { fileName: "oops" }, it produces { matches: null }.
    • Finally, using a $group stage, coupled with $addToSet on the match subfield, we can generate the list of distinct extensions.
    0 讨论(0)
提交回复
热议问题