MongoDB - What is the fastest way to get the latest value as-of a given date?

天涯浪子 提交于 2019-12-12 03:16:50

问题


I have a collection of measurements from different sources, which come at different frequencies.

How do I get the latest good value as-of a specific date, for any given subset of sources? (this is similar to pandas.Index.asof)?

To be clear, for some of these timeseries there might be no available value for the desired date, so I must find the most recent among the available dates that are lower than the query date.

The timeseries could look like this:

{_id:new ObjectId(), source:"1stDayofMonth", date:new ISODate(<day1>) value:somevalue}
{_id:new ObjectId(), source:"Monday", date:new ISODate(<day1>) value:somevalue}
{_id:new ObjectId(), source:"daily", date:new ISODate(<day1>) value:somevalue}
/...
{_id:new ObjectId(), source:"daily", date:new ISODate(<dayN>) value:somevalue}
{_id:new ObjectId(), source:"Wednesday", date:new ISODate(<dayN>) value:somevalue}
// and so on... 

Given proper indexation (db.myCollection.createIndex({date:1, source:1})), how can I get the latest good value as-of a given queryDate, for any subset of sources?

This is how far I got, but this solution fails to return just 1 value per source (if you read the code, you'll see this would work when querying on just one source, but when querying on different ones it returns more than 1 value of the high frequency sources):

querySources = ['1stDayofMonth','Monday']    # as an example, let's say I want only these 2 sources
nSources = np.size(querySources)
cursor = db.myCollection.find( {'source':{ '$in': querySources}, 'date':{ '$lt': queryDate}}).sort(date:-1).limit(nSources)

Any ideas?

Edit: I should have mentioned that the docs point out to this solution, but aggregate might be very slow and the collection large enough that query times become long (say query 1000 sources, each with 10000 days of data)


回答1:


You're getting more than one result because nSources size is larger than 1.

You have to use aggregate if you want to group by the sources or you have to run one find() per source then join the results.

Solution using aggregate:

db.myCollection.aggregate([
{$match : {source: {$in: ["1stDayofMonth", "Monday"]}}},
{$match : {date: {$lt: queryDate}}},
{$sort : { date : -1 } },
{$group : {
    _id : "$source",
    date : {"$first" : "$date"},
    value : {"$first" : "$value"}   
    }}
])

Solution using find():

curs1 = db.myCollection.find( {'source': "1stDayofMonth", 
'date':{ '$lt': queryDate}})
.sort({date:-1}).limit(1);

curs2 = db.myCollection.find( {'source': "Monday", 
'date':{ '$lt': queryDate}})
.sort({date:-1}).limit(1);

# Now add the result from each cursor to an Array in your app



回答2:


For the records, I found a way do covered finds by adding 1 more field to the document:

If I add a "nextDate" field to each document, which contains the date of the next sequential document for that series, then I can do a covered query for max speed:

find( {'ind':{$in:[<sources>]},'date':{'$lte':queryDate}, 'nextDate':{'$gt':queryDate}},
      {'_id':0, 'nextDate':0} ).hint('my_index')

myindex is built on ind, date, nextDate, value

Space & memory intensive, very fast



来源:https://stackoverflow.com/questions/32508023/mongodb-what-is-the-fastest-way-to-get-the-latest-value-as-of-a-given-date

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!