问题
I have a collection of measurements from different sources, which come at different frequencies.
How do I get the latest good value as-of a specific date, for any given subset of sources? (this is similar to pandas.Index.asof)?
To be clear, for some of these timeseries there might be no available value for the desired date, so I must find the most recent among the available dates that are lower than the query date.
The timeseries could look like this:
{_id:new ObjectId(), source:"1stDayofMonth", date:new ISODate(<day1>) value:somevalue}
{_id:new ObjectId(), source:"Monday", date:new ISODate(<day1>) value:somevalue}
{_id:new ObjectId(), source:"daily", date:new ISODate(<day1>) value:somevalue}
/...
{_id:new ObjectId(), source:"daily", date:new ISODate(<dayN>) value:somevalue}
{_id:new ObjectId(), source:"Wednesday", date:new ISODate(<dayN>) value:somevalue}
// and so on...
Given proper indexation (db.myCollection.createIndex({date:1, source:1})
), how can I get the latest good value
as-of a given queryDate
, for any subset of sources
?
This is how far I got, but this solution fails to return just 1 value
per source
(if you read the code, you'll see this would work when querying on just one source
, but when querying on different ones it returns more than 1 value of the high frequency sources):
querySources = ['1stDayofMonth','Monday'] # as an example, let's say I want only these 2 sources
nSources = np.size(querySources)
cursor = db.myCollection.find( {'source':{ '$in': querySources}, 'date':{ '$lt': queryDate}}).sort(date:-1).limit(nSources)
Any ideas?
Edit: I should have mentioned that the docs point out to this solution, but aggregate
might be very slow and the collection large enough that query times become long (say query 1000 sources, each with 10000 days of data)
回答1:
You're getting more than one result because nSources
size is larger than 1.
You have to use aggregate
if you want to group by the sources or you have to run one find()
per source then join the results.
Solution using aggregate
:
db.myCollection.aggregate([
{$match : {source: {$in: ["1stDayofMonth", "Monday"]}}},
{$match : {date: {$lt: queryDate}}},
{$sort : { date : -1 } },
{$group : {
_id : "$source",
date : {"$first" : "$date"},
value : {"$first" : "$value"}
}}
])
Solution using find()
:
curs1 = db.myCollection.find( {'source': "1stDayofMonth",
'date':{ '$lt': queryDate}})
.sort({date:-1}).limit(1);
curs2 = db.myCollection.find( {'source': "Monday",
'date':{ '$lt': queryDate}})
.sort({date:-1}).limit(1);
# Now add the result from each cursor to an Array in your app
回答2:
For the records, I found a way do covered finds by adding 1 more field to the document:
If I add a "nextDate" field to each document, which contains the date of the next sequential document for that series, then I can do a covered query for max speed:
find( {'ind':{$in:[<sources>]},'date':{'$lte':queryDate}, 'nextDate':{'$gt':queryDate}},
{'_id':0, 'nextDate':0} ).hint('my_index')
myindex
is built on ind
, date
, nextDate
, value
Space & memory intensive, very fast
来源:https://stackoverflow.com/questions/32508023/mongodb-what-is-the-fastest-way-to-get-the-latest-value-as-of-a-given-date