Is there a workaround to allow using a regex in the Mongodb aggregation pipeline

问题

I'm trying to create a pipeline which will count how many documents match some conditions. I can't see any way to use a regular expression in the conditions though. Here's a simplified version of my pipeline with annotations:

db.Collection.aggregate([
    // Pipeline before the issue
    {'$group': {
        '_id': {
            'field': '$my_field', // Included for completeness
        },
        'first_count': {'$sum': {                    // We're going to count the number
            '$cond': [                               // of documents that have 'foo' in 
                {'$eq: ['$field_foo', 'foo']}, 1, 0  // $field_foo.
            ] 
        }},                                       

        'second_count': {'$sum': {                       // Here, I want to count the
            '$cond': [                                   // Number of documents where
                {'$regex': ['$field_bar', regex]}, 1, 0  // the value of 'bar' matches
            ]                                            // the regex 
        }},                                          
    },
    // Additional operations
])

I know the syntax is wrong, but I hope this conveys what I'm trying to do. Is there any way to perform this match in the $cond operation? Or, alternatively, I'm also open to the possibility of doing the match somewhere earlier in the pipeline and storing the result in the documents so that I only have to match on a boolean at this point.

回答1:

This question seems to come many times with no solution. There are two possible solutions that I know: solution 1- using mapReduce. mapReduce is the general form of aggregation that let user do anything imaginable and programmable.

following is the mongo shell solution using mapReduce We consider the following 'st' collection.

db.st.find()

{ "_id" : ObjectId("51d6d23b945770d6de5883f1"), "foo" : "foo1", "bar" : "bar1" }
{ "_id" : ObjectId("51d6d249945770d6de5883f2"), "foo" : "foo2", "bar" : "bar2" }
{ "_id" : ObjectId("51d6d25d945770d6de5883f3"), "foo" : "foo2", "bar" : "bar22" }
{ "_id" : ObjectId("51d6d28b945770d6de5883f4"), "foo" : "foo2", "bar" : "bar3" }
{ "_id" : ObjectId("51d6daf6945770d6de5883f5"), "foo" : "foo3", "bar" : "bar3" }
{ "_id" : ObjectId("51d6db03945770d6de5883f6"), "foo" : "foo4", "bar" : "bar24" }

we want to group by foo, and for each foo, count the number of doc, as well as the number of doc with bar containing the substring 'bar2'.that is:

foo1: nbdoc=1, n_match = 0
foo2: nbdoc=3, n_match = 2
foo3: nbdoc=1, n_match = 0
foo4: nbdoc=1, n_match = 1

To do that, define the following map function

var mapFunction = function() {
  var key = this.foo;
  var nb_match_bar2 = 0;
  if( this.bar.match(/bar2/g) ){
    nb_match_bar2 = 1;
  }
  var value = {
    count: 1,
    nb_match: nb_match_bar2
  };

  emit( key, value );
};

and the following reduce function

var reduceFunction = function(key, values) {

  var reducedObject = {
    count: 0,
    nb_match:0
  };
  values.forEach( function(value) {
    reducedObject.count += value.count;
    reducedObject.nb_match += value.nb_match;
  }
  );
  return reducedObject;
};

run mapduce and store the result in the collection map_reduce_result

db.st.mapReduce(mapFunction, reduceFunction, {out:'map_reduce_result'})
{
  "result" : "map_reduce_result",
  "timeMillis" : 7,
  "counts" : {
    "input" : 6,
    "emit" : 6,
    "reduce" : 1,
    "output" : 4
},
"ok" : 1,
}

Finally, we can query the collection map_reduce_result, voila! the solution

> db.map_reduce_result.find()
{ "_id" : "foo1", "value" : { "count" : 1, "nb_match" : 0 } }
{ "_id" : "foo2", "value" : { "count" : 3, "nb_match" : 2 } }
{ "_id" : "foo3", "value" : { "count" : 1, "nb_match" : 0 } }
{ "_id" : "foo4", "value" : { "count" : 1, "nb_match" : 1 } }

solution 2- using two separate aggregations and merge I won't give details for this solution as any mongo user can easily do it. step 1: do the aggregation, ignoring the part that requires regex to sum. step 2: do a second aggregation grouping on the same key as the one of step one. stage 1 of the pipeline: match the regular expression; stage 2: group on the same key as in the first step and count the number of doc in each group {$sum: 1}; step 3: merge the result of step 1 and 2: for each key that appears in both result add the new field, if the key does is not present in the second result set the new key to 0.

Voila! another solution.

来源：https://stackoverflow.com/questions/17458190/is-there-a-workaround-to-allow-using-a-regex-in-the-mongodb-aggregation-pipeline

标签

regex

mongodb

MapReduce

aggregation-framework

pymongo