问题
NOTE: I am using a graph database (OrientDB to be specific). This gives me the freedom to write a server-side function in javascript or groovy rather than limit myself to SQL for this issue.*
NOTE 2: Since this is a graph database, the arrows below are simply describing the flow of data. I do not literally need the arrows to be returned in the query. The arrows represent relationships.*
I have data that is represented in a time-flow manner; i.e. EventC occurs after EventB which occurs after EventA, etc. This data is coming from multiple sources, so it is not completely linear. It needs to be congregated together, which is where I'm having the issue.
Currently the data looks something like this:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
Where "next" is the out() edge to the event that comes next in the time-flow. On a graph this comes out to look like:
EventA-->EventB-->EventC
EventA-->EventD
Since this data needs to be congregated together, I need to merge duplicate events but preserve their edges. In other words, I need a select query that will result in:
-->EventB-->EventC
EventA--|
-->EventD
In this example, since EventB and EventD both occurred after EventA (just at different times), the select query will show two branches off EventA as opposed to two separate time-flows.
EDIT #2
If an additional set of data were to be added to the data above, with EventB->EventE, the resulting data/graph would look like:
# | event | next
--------------------------
12:0 | EventA | 12:1
12:1 | EventB | 12:2
12:2 | EventC |
12:3 | EventA | 12:4
12:4 | EventD |
12:5 | EventB | 12:6
12:6 | EventE |
EventA-->EventB-->EventC
EventA-->EventD
EventB-->EventE
I need a query to produce a tree like:
-->EventC
-->EventB--|
| -->EventE
EventA--|
-->EventD
EDIT #3 and #4
Here is the data with edges shown as opposed to the "next" column above. I also added a couple additional columns here to hopefully clear up any confusion about the data:
# | event | ip_address | timestamp | in | out |
----------------------------------------------------------------------------
12:0 | EventA | 123.156.189.18 | 2015-04-17 12:48:01 | | 13:0 |
12:1 | EventB | 123.156.189.18 | 2015-04-17 12:48:32 | 13:0 | 13:1 |
12:2 | EventC | 123.156.189.18 | 2015-04-17 12:48:49 | 13:1 | |
12:3 | EventA | 103.145.187.22 | 2015-04-17 14:03:08 | | 13:2 |
12:4 | EventD | 103.145.187.22 | 2015-04-17 14:05:23 | 13:2 | |
12:5 | EventB | 96.109.199.184 | 2015-04-17 21:53:00 | | 13:3 |
12:6 | EventE | 96.109.199.184 | 2015-04-17 21:53:07 | 13:3 | |
The data is saved like this to preserve each individual event and the flow of a session (labeled by the ip address).
TL;DR
Got lots of events, some duplicates, and need them all organized into one neat time-flow graph.
回答1:
Holy cow.
After wrestling with this for over a week I think I FINALLY have a working function. This isn't optimized for performance (oh the loops!), but gets the job done for the time being while I can work on performance. The resulting OrientDB server-side function (written in javascript):
The function:
// Clear previous runs
db.command("truncate class tmp_Then");
db.command("truncate class tmp_Events");
// Get all distinct events
var distinctEvents = db.query("select from Events group by event");
// Send 404 if null, otherwise proceed
if (distinctEvents == null) {
response.send(404, "Events not found", "text/plain", "Error: events not found" );
} else {
var edges = [];
// Loop through all distinct events
distinctEvents.forEach(function(distinctEvent) {
var newEvent = [];
var rid = distinctEvent.field("@rid");
var eventType = distinctEvent.field("event");
// The main query that finds all *direct* descendents of the distinct event
var result = db.query("select from (traverse * from (select from Events where event = ?) where $depth <= 2) where @class = 'Events' and $depth > 1 and @rid in (select from Events group by event)", [eventType]);
// Save the distinct event in a temp table to create temp edges
db.command("create vertex tmp_Events set rid = ?, event = ?", [rid, event]);
edges.push(result);
});
// The edges array defines which edges should exist for a given event
edges.forEach(function(edge, index) {
edge.forEach(function(e) {
// Create the temp edge that corresponds to its distinct event
db.command("create edge tmp_Then from (select from tmp_Events where rid = " + distinctEvents[index].field("@rid") + ") to (select from tmp_Events where rid = " + e.field("@rid") + ")");
});
});
var result = db.query("select from tmp_Events");
return result;
}
Takeaways:
- Temp tables appeared to be necessary. I tried to do this without temp tables (classes), but I'm not sure it could be done. I needed to mock edges that didn't exist in the raw data.
- Traverse was very helpful in writing the main query. Traversing through an event to find its direct, unique descendents was fairly simple.
- Having the ability to write stored procs in Javascript is freaking awesome. This would have been a nightmare in SQL.
- omfg loops. I plan to optimize this and continue to make it better so hopefully other people can find some use for it.
来源:https://stackoverflow.com/questions/29583040/how-to-write-a-select-query-or-server-side-function-that-will-generate-a-neat-ti