I need to extract a part of a string that matches a regex and return it.
I have a set of documents such as:
{\"_id\" :12121, \"fileName\" : \"apple.d
It's almost undoable to do it in the aggregation pipe, you want to project your matches and include only the part after the period. There is no (yet) operator to locate the position of the period. You need the position because $substr (https://docs.mongodb.com/manual/reference/operator/aggregation/substr/) requires a start position. In addition $regEx is only for matching, you cannot use it in a projection to replace.
I think for now it's a easier to do it in code. here you could use a replace regex or any other solution provided by your language
It will be possible to do this in the upcoming version of MongoDB(as the time of this writing) using the aggregation framework and the $indexOfCP
operator. Until then, your best bet here is MapReduce
.
var mapper = function() {
emit(this._id, this.fileName.substring(this.fileName.indexOf(".")))
};
db.coll.mapReduce(mapper,
function(key, value) {},
{ "out": { "inline": 1 }}
)["results"]
Which yields:
[
{
"_id" : 12121,
"value" : ".doc"
},
{
"_id" : 12125,
"value" : ".txt"
},
{
"_id" : 12126,
"value" : ".pdf"
},
{
"_id" : 12127,
"value" : ".txt"
}
]
For completness here is the solution using the aggregation framework*
db.coll.aggregate(
[
{ "$match": { "name": /\.[0-9a-z]+$/i } },
{ "$group": {
"_id": null,
"extension": {
"$push": {
"$substr": [
"$fileName",
{ "$indexOfCP": [ "$fileName", "." ] },
-1
]
}
}
}}
])
which produces:
{
"_id" : null,
"extensions" : [ ".doc", ".txt", ".pdf", ".txt" ]
}
*current development version of MongoDB (as the time of this writing).
Starting Mongo 4.2
, the $regexFind aggregation operator makes things easier:
// { _id : 12121, fileName: "apple.doc" }
// { _id : 12125, fileName: "rap.txt" }
// { _id : 12126, fileName: "tap.pdf" }
// { _id : 12127, fileName: "cricket.txt" }
// { _id : 12129, fileName: "oops" }
db.collection.aggregate([
{ $set: { ext: { $regexFind: { input: "$fileName", regex: /\.\w+$/ } } } },
{ $group: { _id: null, extensions: { $addToSet: "$ext.match" } } }
])
// { _id: null, extensions: [ ".doc", ".pdf", ".txt" ] }
This makes use of:
ext
) is the result of the $regexFind
operator, which captures the result of a matching regex. If a match is found, it returns a document that contains information on the first match. If a match is not found, returns null
. For instance:
{ fileName: "tap.pdf" }
, it produces { matches: { match: ".pdf", idx: 3, captures: [] }
.{ fileName: "oops" }
, it produces { matches: null }
.$group
stage, coupled with $addToSet on the match
subfield, we can generate the list of distinct extensions.