Average Aggregation with String Timestamp

问题

I have records in database as follow :

{
    "_id" : ObjectId("592d4f43d69b643ac0cb9149"),
    "timestamp" : "2017-03-01 17:09:00",
    "Technique-Meteo_Direction moyenne du vent_Mean value wind direction[]" : 0.0,
    "Technique-Meteo_Précipitations_Precipitation status[]" : 0.0,
    "Technique-Meteo_Direction du vent_Wind direction[]" : 0.0
}

{
    "_id" : ObjectId("592d3a6cd69b643ac0cae395"),
    "timestamp" : "2017-01-30 09:31:00",
    "Technique-Electrique_Prises de Courant_Power1[W]" : 14.0,
    "Technique-Electrique_Eclairage_Power2[W]" : 360.0,
    "Technique-Electrique_Electroménager_Power3[W]" : 0.0,
    "Technique-Electrique_VMC Aldes_Power4[W]" : 14.0,
    "Technique-Electrique_VMC Unelvent_Power5[W]" : 8.0

My timestamp is a simple string which I would prefer not touch because of the amount of changes on other algorithms. However, I would like to do some average operations. In fact, other fields are sensors names with their measurement. I have one record each minute and I would like to average these values during one hour, one day or one month.

Just before, I created one query to count the number of existing value per month for one field

countExistingPerMonth = client[page1.currentDB][page2.currentColl].find({"$and":[{"timestamp":{"$regex": regexExpression}}, {chosenSensor:{"$exists": True}}]}, temp_doc).count()

I used a $regex expression to find documents matching the chosen month.

Is there any way to do my average operations using this kind of method?

I tried to do something (below). I also tried to use regex expression to aggregate but it was not possible.

self.sensorsStats = []
        for chosenSensor in self.chosenSensors:   
            countPerMonth = []
            years = []
            incre_year = int(page5.combo_startYear.get())
            if (incre_year<=int(page5.combo_endYear.get())):
                while(incre_year!=(int(page5.combo_endYear.get())+1)):
                    years.append(str(incre_year))
                    incre_year += 1

            for year in years:
                for month in ["01","02","03","04","05","06","07","08","09","10","11","12"]:
                    regexExpression = '^'+year+'-'+month+'-..'

                    test = client[page1.currentDB][page2.currentColl].aggregate([{"$match":{"timestamp":{"$regex": regexExpression}}}, {"$group":{"_id":chosenSensor, "average":{"$avg":{chosenSensor}}}}])

回答1:

Realistically you "should" fix the timestamp strings here. But they are at least in "lexical order" due to the "yyyy-dd-mm" format inherent in ISO Strings.

So since they have a fixed length, we can actually aggregate on them using the aggregation framework for a server side aggregation.

Sampling the month of May for date selection:

cursor = client[page1.currentDB][page2.currentColl].aggregate([
  { "$match": {
     "Technique-Meteo_Direction moyenne du vent_Mean value wind direction[]":
       { "$exists": True },
     "timestamp": {
       "$gte": "2017-05-01 00:00:00", "$lt": "2017-06-01 00:00:00"
     }
  }},
  { "$group": {
    "_id": {
      "$substr": [ "$timestamp", 0, 10 ]
    },
    "average":
      { "$avg": "$Technique-Meteo_Direction moyenne du vent_Mean value wind direction[]" }
  }}
])

This would get the total "per day" for each day in the selected month. This relies on the lexical value of the fields. The same basic principle applies to all intervals here. So you simply fill the strings with the zero values up until the interval you want the the selection.

The same goes for the "grouping key" here, where the value to _id should similarly be the substring up until the required interval. Fortunately the string format is "zero padded" so values less than "10" are preceded by a zero as in "05". Again this maintains the lexical order for "ranges".

That is what you should be aiming for, and I presume you should be selecting your fields in here, as well as generating the timestamp strings for the range selection.

But you certainly can gain something by being able to $group on the [$substr][2] part of the actual value to indicate your required interval, and not need to iterate multiple query invocations simply for each interval and just let the database do it for you.

Your "keys" however are another issue, and since they are not consistent you seem stuck with iterating through the possible "key names" and performing a separate aggregation for all of them. You could possibly make the statement longer and get the "counts" and "sums" for each using $ifNull to determine when to increment. Then you would $divide "after" the $group pipeline stage to get the final "average".

That last bit is a bit complicated without knowing the full scope, and it's not all completely in your question. So I'll leave that up to you to work out, or ask a separate question about.

N.B The $substr here is actually deprecated as of MongoDB 3.4. The replacement operators are $substrBytes and $substrCP. The operator used here is now considered an alias for $substrBytes, and they differ in Code Page treatment for the consideration of "length" as is documented. You should use appropriate to your Code Page, but chances are the "timestamp" is consistently in a single byte encoding anyway.

来源：https://stackoverflow.com/questions/44694452/average-aggregation-with-string-timestamp

标签

python

mongodb

aggregation-framework

pymongo