问题
I have two data sets in google app engine datastore.
class First_Set(db.Model):
start_time = db.DateTimeProperty()
end_time = db.DateTimeProperty()
data1 = db.FloatProperty()
...
class Second_Set(db.Model):
start_time = db.DateTimeProperty()
end_time = db.DateTimeProperty()
data2 = db.FloatProperty()
...
(They have other different data that's why they're in different datasets.)
I'd like to find the datastore IDs all the overlapping start_time and end_time across two datasets, ideally without pulling results from one and iterating the first results over the other.
A great visualization of the initial dataset is from here (it also has the problem solved in SQL):
1 |-----|
2 |-----|
3 |--|
4 |-----|
5 |-----|
6 |---|
7 |---|
8 |---|
9 |-----|
End result I need is something in the tune of (from the same example):
+----+---------------------+----+---------------------+
| id | start | id | end |
+----+---------------------+----+---------------------+
| 2 | 2008-09-01 15:02:00 | 1 | 2008-09-01 15:04:00 |
| 5 | 2008-09-01 16:19:00 | 4 | 2008-09-01 16:23:00 |
| 8 | 2008-09-01 16:20:00 | 4 | 2008-09-01 16:22:00 |
| 8 | 2008-09-01 16:20:00 | 5 | 2008-09-01 16:22:00 |
| 7 | 2008-09-01 18:18:00 | 9 | 2008-09-01 18:22:00 |
+----+---------------------+----+---------------------+
SQL solution is described in the example as below but I couldn't do this in datastore because of lack of JOIN:
SELECT v1.id, v1.start, v2.id, LEAST(v1.end,v2.end) AS end
FROM visits v1
JOIN visits v2 ON v1.id <> v2.id and v1.start >= v2.start and v1.start < v2.end
ORDER BY v1.start;
I understand that one-to-many version of this is rather straightforward using a ListProperty() (from this question).
Can anyone think of a solution to find the overlapping times (ideally in Python)?
回答1:
Look into Marzullo's algorithm its time efficiency is O(n log n).
There are also many question on Stackoverflow that cover overlapping intervals which can be used to solve your problem on AppEngine.
回答2:
Posting my solution with no JOINs, thanks to Shay's direction. Should be able to find overlaps over any number of datasets with minor edits (at least that's the theory).
My Python isn't that great but below should give the idea:
from operator import itemgetter
class Find_Overlaps(webapp2.RequestHandler):
def get(self):
all_dates = []
first_dates = db.GqlQuery("SELECT * FROM First_Set")
for date in first_dates:
row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
all_dates.append(row)
row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
all_dates.append(row)
second_dates = db.GqlQuery("SELECT * FROM Second_Set")
for date in second_dates:
row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
all_dates.append(row)
row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
all_dates.append(row)
newlist = sorted(all_dates, key=itemgetter('offset','type'))
number_datasets = 2 #goal is to find overlaps in all sets not only the best overlaps, that's why this is needed
loopcnt = 0
update_bestend = 0
overlaps = []
for row in newlist: #Below is mostly from Marzullo's alghorithm
loopcnt = loopcnt - row['type']#this is to keep track of overall tally
if update_bestend == 1:
if loopcnt == (number_datasets - 1):
bestend = row['offset']
end_set = row['dataset']
end_key = row['dbkey']
overlaps.append({'start':beststart,'start_set':start_set,'start_key':start_key,'end':bestend,'end_set':end_set,'end_key':end_key})
update_bestend = 0
if loopcnt == number_datasets:
beststart = row['offset']
start_set = row['dataset']
start_key = row['dbkey']
update_bestend = 1
for overlap in overlaps: #just to see what the outcome is
self.response.out.write('start: %s, start_set: %s, end: %s, end_set: %s<br>' % (overlap['start'], overlap['start_set'], overlap['end'], overlap['end_set']))
来源:https://stackoverflow.com/questions/11161238/compare-many-date-ranges-in-google-app-engine-datastore-many-to-many-python