Finding most commonly used word in a string field throughout a collection

☆樱花仙子☆ 提交于 2019-12-12 12:12:14

问题


Let's say I have a Mongo collection similar to the following:

[
  { "foo": "bar baz boo" },
  { "foo": "bar baz" },
  { "foo": "boo baz" }
]

Is it possible to determine which words appear most often within the foo field (ideally with a count)?

For instance, I'd love a result set of something like:

[
  { "baz" : 3 },
  { "boo" : 2 },
  { "bar" : 2 }
]

回答1:


There was recently closed a JIRA issue about a $split operator to be used in the $project stage of the aggregation framework.
With that in place you could create a pipeline like this

db.yourColl.aggregate([
    {
        $project: {
            words: { $split: ["$foo", " "] }
        }
    },
    {
        $unwind: {
            path: "$words"
        }
    },
    {
        $group: {
            _id: "$words",
            count: { $sum: 1 }
        }
    }
])

result would look like so

/* 1 */
{
    "_id" : "baz",
    "count" : 3.0
}

/* 2 */
{
    "_id" : "boo",
    "count" : 2.0
}

/* 3 */
{
    "_id" : "bar",
    "count" : 2.0
}



回答2:


The best way to do this in in MongoDB 3.4 using the $split operator to split your string into an array of substring as mentioned here and because we need to $unwind the array down in the pipeline, we need to do this in a sub-pipeline using the $facet operator for maximum efficiency.

db.collection.aggregate([
    { "$facet": { 
        "results": [ 
            { "$project": { 
                "values": { "$split": [ "$foo", " " ] }
            }}, 
            { "$unwind": "$values" }, 
            { "$group": { 
                "_id": "$values", 
                "count": { "$sum": 1 } 
            }} 
        ]
    }}
])

which produces:

{
    "results" : [
        {
            "_id" : "boo",
            "count" : 2
       },
       {
            "_id" : "baz",
            "count" : 3
       },
       {
            "_id" : "bar",
            "count" : 2
       }
   ]
}

From MongoDB 3.2 backwards, the only way to do this is with mapReduce.

var reduceFunction = function(key, value) { 
    var results = {}; 
    for ( var items of Array.concat(value)) { 
        for (var item of items) {
            results[item] = results[item] ? results[item] + 1 : 1;
        } 
    }; 
    return results; 
}

db.collection.mapReduce(
    function() { emit(null, this.foo.split(" ")); }, 
    reduceFunction, 
    { "out": { "inline": 1 } } 
)

which returns:

{
    "results" : [
        {
            "_id" : null,
            "value" : {
                "bar" : 2,
                "baz" : 3,
                "boo" : 2
            }
        }
    ],
    "timeMillis" : 30,
    "counts" : {
        "input" : 3,
        "emit" : 3,
        "reduce" : 1,
        "output" : 1
    },
    "ok" : 1
}

You should consider to use a .forEach() method in the reduce function if your MongoDB version doesn't support a the for...of statement.



来源:https://stackoverflow.com/questions/38750429/finding-most-commonly-used-word-in-a-string-field-throughout-a-collection

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!