How can I decrease unwind stages in aggregation pipeline for nested documents?

后端 未结 2 1430
滥情空心
滥情空心 2021-01-28 22:31

I am new in mongodb and trying to work with nested documents.I have a query as below

    db.EndpointData.aggregate([
{ \"$group\" : { \"_id\" : \"$EndpointId\",         


        
2条回答
  •  说谎
    说谎 (楼主)
    2021-01-28 23:11

    As long as your data has unique sensor and tag readings per document, which to date what you have presented appears to, then you simply don't need $unwind at all.

    In fact, all you really need is a single $group:

    db.endpoints.aggregate([
      // In reality you would $match to limit the selection of documents
      { "$match": { 
        "DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
      }},
      { "$group": {
        "_id": "$EndpointId",
        "FirstActivity" : { "$min" : "$DateTime" },
        "LastActivity" : { "$max" : "$DateTime" },
        "RequestCount": { "$sum": 1 },
        "TagCount": {
          "$sum": {
            "$size": { "$setUnion": ["$Tags.Uid",[]] }
          }
        },
        "SensorCount": {
          "$sum": {
            "$sum": {
              "$map": {
                "input": { "$setUnion": ["$Tags.Uid",[]] },
                "as": "tag",
                "in": {
                  "$size": {
                    "$reduce": {
                      "input": {
                        "$filter": {
                          "input": {
                            "$map": {
                              "input": "$Tags",
                              "in": {
                                "Uid": "$$this.Uid",
                                "Type": "$$this.Sensors.Type"
                              }
                            }
                          },
                          "cond": { "$eq": [ "$$this.Uid", "$$tag" ] }
                        }
                      },
                      "initialValue": [],
                      "in": { "$setUnion": [ "$$value", "$$this.Type" ] }
                    }
                  }
                }
              }
            }
          }
        }
      }}
    ])
    

    Or if you actually do need to accumulate those "unique" values of "Sensors" and "Tags" from across different documents, then you still need initial $unwind statements to get the right grouping, but nowhere near as much as you presently have:

    db.endpoints.aggregate([
      // In reality you would $match to limit the selection of documents
      { "$match": { 
        "DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
      }},
      { "$unwind": "$Tags" },
      { "$unwind": "$Tags.Sensors" },
      { "$group": {
        "_id": {
          "EndpointId": "$EndpointId",
          "Uid": "$Tags.Uid",
          "Type": "$Tags.Sensors.Type"
        },
        "FirstActivity": { "$min": "$DateTime" },
        "LastActivity": { "$max": "$DateTime" },
        "RequestCount": { "$addToSet": "$_id" }
      }},
      { "$group": {
        "_id": {
          "EndpointId": "$_id.EndpointId",
          "Uid": "$_id.Uid",
        },
        "FirstActivity": { "$min": "$FirstActivity" },
        "LastActivity": { "$max": "$LastActivity" },
        "count": { "$sum": 1 },
        "RequestCount": { "$addToSet": "$RequestCount" }
      }},
      { "$group": {
        "_id": "$_id.EndpointId",
        "FirstActivity": { "$min": "$FirstActivity" },
        "LastActivity": { "$max": "$LastActivity" },
        "TagCount": { "$sum": 1 },
        "SensorCount": { "$sum": "$count" },
        "RequestCount": { "$addToSet": "$RequestCount" }
      }},
      { "$addFields": {
        "RequestCount": {
          "$size": {
            "$reduce": {
              "input": {
                "$reduce": {
                  "input": "$RequestCount",
                  "initialValue": [],
                  "in": { "$setUnion": [ "$$value", "$$this" ] }
                }
              },
              "initialValue": [],
              "in": { "$setUnion": [ "$$value", "$$this" ] }
            }
          }
        }
      }}
    ],{ "allowDiskUse": true })
    

    And from MongoDB 4.0 you can use $toString on the ObjectId within _id and simply merge the unique keys for those in order to keep the RequestCount using $mergeObjects. This is cleaner and a bit more scalable than pushing nested array content and flattening it

    db.endpoints.aggregate([
      // In reality you would $match to limit the selection of documents
      { "$match": { 
        "DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
      }},
      { "$unwind": "$Tags" },
      { "$unwind": "$Tags.Sensors" },
      { "$group": {
        "_id": {
          "EndpointId": "$EndpointId",
          "Uid": "$Tags.Uid",
          "Type": "$Tags.Sensors.Type"
        },
        "FirstActivity": { "$min": "$DateTime" },
        "LastActivity": { "$max": "$DateTime" },
        "RequestCount": {
          "$mergeObjects": {
            "$arrayToObject": [[{ "k": { "$toString": "$_id" }, "v": 1 }]]
          }
        }
      }},
      { "$group": {
        "_id": {
          "EndpointId": "$_id.EndpointId",
          "Uid": "$_id.Uid",
        },
        "FirstActivity": { "$min": "$FirstActivity" },
        "LastActivity": { "$max": "$LastActivity" },
        "count": { "$sum": 1 },
        "RequestCount": { "$mergeObjects": "$RequestCount" }
      }},
      { "$group": {
        "_id": "$_id.EndpointId",
        "FirstActivity": { "$min": "$FirstActivity" },
        "LastActivity": { "$max": "$LastActivity" },
        "TagCount": { "$sum": 1 },
        "SensorCount": { "$sum": "$count" },
        "RequestCount": { "$mergeObjects": "$RequestCount" }
      }},
      { "$addFields": {
        "RequestCount": {
          "$size": {
            "$objectToArray": "$RequestCount"
          }
        }
      }}
    ],{ "allowDiskUse": true })
    

    Either form returns the same data, though the order of keys in the result may vary:

    {
            "_id" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",
            "FirstActivity" : ISODate("2018-05-06T19:05:02.666Z"),
            "LastActivity" : ISODate("2018-05-06T19:05:02.666Z"),
            "RequestCount" : 2,
            "TagCount" : 4,
            "SensorCount" : 16
    }
    

    The result is obtained from these sample documents which you originally gave as a sample source in the original question on the topic:

    {
        "_id" : ObjectId("5aef51dfaf42ea1b70d0c4db"),    
        "EndpointId" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",    
        "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
        "Url" : "test",
        "Tags" : [ 
            {
                "Uid" : "C1:3D:CA:D4:45:11",
                "Type" : 1,
                "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
                "Sensors" : [ 
                    {
                        "Type" : 1,
                        "Value" : NumberDecimal("-95")
                    }, 
                    {
                        "Type" : 2,
                        "Value" : NumberDecimal("-59")
                    }, 
                    {
                        "Type" : 3,
                        "Value" : NumberDecimal("11.029802536740132")
                    }, 
                    {
                        "Type" : 4,
                        "Value" : NumberDecimal("27.25")
                    }, 
                    {
                        "Type" : 6,
                        "Value" : NumberDecimal("2924")
                    }
                ]
            },         
            {
                "Uid" : "C1:3D:CA:D4:45:11",
                "Type" : 1,
                "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
                "Sensors" : [ 
                    {
                        "Type" : 1,
                        "Value" : NumberDecimal("-95")
                    }, 
                    {
                        "Type" : 2,
                        "Value" : NumberDecimal("-59")
                    }, 
                    {
                        "Type" : 3,
                        "Value" : NumberDecimal("11.413037961112279")
                    }, 
                    {
                        "Type" : 4,
                        "Value" : NumberDecimal("27.25")
                    }, 
                    {
                        "Type" : 6,
                        "Value" : NumberDecimal("2924")
                    }
                ]
            },          
            {
                "Uid" : "E5:FA:2A:35:AF:DD",
                "Type" : 1,
                "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
                "Sensors" : [ 
                    {
                        "Type" : 1,
                        "Value" : NumberDecimal("-97")
                    }, 
                    {
                        "Type" : 2,
                        "Value" : NumberDecimal("-58")
                    }, 
                    {
                        "Type" : 3,
                        "Value" : NumberDecimal("10.171658037099185")
                    }
                ]
            }
        ]
    }
    
    /* 2 */
    {
        "_id" : ObjectId("5aef51e0af42ea1b70d0c4dc"),    
        "EndpointId" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",    
        "Url" : "test",
        "Tags" : [ 
            {
                "Uid" : "E2:02:00:18:DA:40",
                "Type" : 1,
                "DateTime" : ISODate("2018-05-06T19:05:04.574Z"),
                "Sensors" : [ 
                    {
                        "Type" : 1,
                        "Value" : NumberDecimal("-98")
                    }, 
                    {
                        "Type" : 2,
                        "Value" : NumberDecimal("-65")
                    }, 
                    {
                        "Type" : 3,
                        "Value" : NumberDecimal("7.845424441900629")
                    }, 
                    {
                        "Type" : 4,
                        "Value" : NumberDecimal("0.0")
                    }, 
                    {
                        "Type" : 6,
                        "Value" : NumberDecimal("3012")
                    }
                ]
            }, 
            {
                "Uid" : "12:3B:6A:1A:B7:F9",
                "Type" : 1,
                "DateTime" : ISODate("2018-05-06T19:05:04.574Z"),
                "Sensors" : [ 
                    {
                        "Type" : 1,
                        "Value" : NumberDecimal("-95")
                    }, 
                    {
                        "Type" : 2,
                        "Value" : NumberDecimal("-59")
                    }, 
                    {
                        "Type" : 3,
                        "Value" : NumberDecimal("12.939770381907275")
                    }
                ]
            }
        ]
    }
    

    Bottom line is that you can either use the first given form here which will accumulate "within each document" and then "accumulate per endpoint" within a single stage and is the most optimal, or you actually require to identify things like the "Uid" on the tags or the "Type" on the sensor where those values occur more than once over any combination of documents grouping by the endpoint.

    Your sample data supplied to date only shows that these values are "unique within each document", therefore the first given form would be most optimal if this is the case for all remaining data.

    In the event that it is not, then "unwinding" the two nested arrays in order to "aggregate the detail across documents" is the only way to approach this. You can limit the date range or other criteria as most "queries" typically have some bounds and do not actually work on the "whole" collection data, but the main fact remains that arrays would be "unwound" creating essentially a document copy for every array member.

    The point on optimization means that you only need to do this "twice" as there are only two arrays. Doing successive $group to $unwind to $group is always a sure sign you a doing something really wrong. Once you "take something apart" you should only ever need to "put it back together" once. In a series of graded steps as demonstrated here is the once approach which optimizes.

    Outside of the scope of your question still remains:

    • Add other realistic constraints to the query to reduce the documents processed, maybe even do so in "batches" and combine results
    • Add the allowDiskUse option to the pipeline to let temporary storage be used. ( actually demonstrated on the commands )
    • Consider that "nested arrays" are probably not the best storage method for the analysis you want to do. It's always more efficient when you know you need to $unwind to simply write the data in that "unwound" form directly into a collection.

提交回复
热议问题