问题
I am looking for an efficient data structure, that'll allow me to cue events ... that is, i will be having an app, where at any time in execution, it is possible, that an event will be raised for a future point in execution ... something like:
- t=20: in 420 seconds, A occurs
- t=25: in 13 seconds, B occurs
- t=27: in 735 seconds, C occurs
- ...
so i'd like to have a data structure, where i can put in any event in any time in the future and where i can get and (by doing so) remove all due events ... also, a plus would be, if i were able to remove an event from the datastructure (because it was canceled) ... not too important though, since i can simply flag it as cancelled ...
my first thought was, maybe to do some sort of tree, but i guess the removing-due-events part requires a lot of rebalancing ...
i am considering simply having an int hash, mapping timestamps to either null or stacks of events that are to occur at that point in time ... i think in scenarios, with a lot of events (possibly multiple every second - which is what i intend to work with), this actually isn't such a bad idea after all ...
so i am eager to hear your input ... :)
edit:
- to be more specific: i think n here is at about 100K-1M, and i guess i might be having about 1-100 events/second ...
- the t is of no special importance ... it is only to illustrate that a future event can be "enqueued" at any time ...
thanks
back2dos
回答1:
If your events have a well defined upper limit (e.g. no events later than 2 days in the future), you can simply have an array indexed by # of seconds from "beginning of time". The value of the array is a list of events at that offset.
Listing or Removal is very efficient - simply find the offset for the time where you wish to list or cut off and get or re-initialize the arrays pointed to by indices after that offset.
If your events can stretch out indefinitely into the future, then your own idea of using a hashmap from offsets to list of events is the best one, with a twist - have a sorted list (however you wish to implement it) of known offsets, that way you will have very efficient lookups (e.g. you won't have to loop over every key ion the map).
You don't need to delete anything from the list of known offsets so no issues with re-balancing - you merely delete from the arrays that hashmap points to.
Also, it seems unclear from your question whether there's any need to know "t" - the time when the event was raised. If you need to know it, store it as part of the event. but the reference to when the event should happen should all be absolute in relation to some starting point (if it's a hashmap with unbounded range, you can use epoch seconds, and if events have bounds like in the first array solution I listed, you should instead use "# of seconds since beginning of range" - e.g. from start of day yesterday.
回答2:
I believe you're looking for a Priority Queue with the timestamp of when the event occurs being the priority (well, lower timestamps would be higher priority)
Just a little elucidation with your use cases:
... where i can put in any event in any time in the future...
You'd insert into the priority queue with insertWithPriority, using the timestamp of when the event occurs. This would be O(lgN)
... and where i can get and (by doing so) remove all due events ...
You'd repeatedly call getTop (gets event with the lowest timestamp) collecting all elements up to the time of interest.
... also, a plus would be, if i were able to remove an event from the datastructure (because it was canceled) ... not too important though, since i can simply flag it as cancelled ..
This would be possible, but would be O(lgN) due to rebalancing.
回答3:
How big is N? How often do you have to insert & remove items, compared to everything else going on? If this is more than 10% of total execution time, and if N is typically more than 100 (say) maybe, maybe it's time to be concerned about big-O. I've seen programs with priority queues implemented with fancy container algorithms, allocating iterators, hash maps, heaps, etc. and spending all their time in creating and releasing abstract objects, where the median queue length was like three.
ADDED: OK, since N ~ 10^6, and the frequency is ~ 100hz, you probably want some sort of binary tree or heap with O(log(N)) insertion/removal time. If you're willing to devote 1% of CPU time to this, that is 10^6 microseconds * 1% / 100 = 10^2 microseconds / operation. That should not be at all difficult, because if typical search depth is 20, at ~50ns per comparison, that is ~1 microsecond to do the search. Just make sure to keep it simple, not getting all wrapped up in abstract datatypes. You shouldn't have to worry much about time spent in allocating/freeing tree nodes because you only allocate/free one node per operation. Rebalancing need not be done frequently, like maybe only after every 1000 operations. If you can collect your insertions in batches, and then insert them in random order, that may prevent the tree from getting too unbalanced. If many of your events are simultaneous, you could add a small amount of noise to the time code, to prevent parts of the tree from becoming more like a linear list.
回答4:
Ok, I'd like to thank you all for your answers - very interesting and helpful. :)
PriorityQueue is definitely the right term I was searching for - thanks for that. Now it's all about implementation.
Here is what I think:
Let N be the size of the queue and M be the average amount of events per timestamp ("concurrent" events so to speak) at the time of processing (the density of events will not be evenly distributed, the "far future" beeing much more sparse, but as time moves on, this area of time becomes much more dense (actually, I think the maximum density will be somewhere in the 4 to 12 hours future)). I am looking for a scalable solution, that performs well for considerably big M. The goal is to really process those M due events within one second, so I wanna spend the least time possible on finding them.
- Going for the simple tree approach, as suggested several times, I'll be having O(log N) insertion, which is quite good, I guess. The cost of processing one timestamp would be O(M*log N), if I am right, which is not so good anymore.
- An alternative would be, to have a tree with lists of events instead of single events. it should be feasible to implement some getlistForGivenStampAndCreateIfNoneExists-operation that'd be a little faster than going down the tree twice if no list exists. But anyway, as M grows, this shouldn't even matter too much. Thus insertion would be O(log N), as before, and processing would be at O(M+log N), which is also good, I think.
- The hash-of-lists-of-events approach, I formulated. This also should have O(1) insertion and O(M) processing cost, although this is not too trivial with hashes. Sounds cool, actually. Or am I missing something? Of course it is not so easy to make a hash perform well, but apart from that, are there any problems? Or is the hash the problem? Wikipedia states:
"In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized) cost per operation."
A quick benchmark showed that the standard implementation for my platform seems to match this. - The array-of-lists-of-events approach provided by DVK. This has O(1) insertion. Now that is good. But if I understand correctly, it has O(M+T) processing cost, with T being the size of the array (the number of time slots if you will), because removal from arrays comes at linear cost. Also, this only works if there is a maximum time offset.
Actually, I would like to discuss the array approach. O(M+T) is not good. Not at all. But I put some brains into it, and this is what I came up with:
First Idea: Lazyness
The O(T) could be crunched down by an arbitrary factor, introducting a bit of lazyness, but in the end it'd stay O(T). But how bad is that? Let's have T=2419200, which is 28 days. And then, once a day I'd clean it up (preferably while low load is expected). That'd waste less than 5% of the array. On my target platform, the copy operation takes 31msecs on a fairly old 2GHz core, so it doesn't seem such a bad idea after all.
Second Idea: Chunks
After thinking a little, I thought of this solution: a hash-of-intervals, an interval (I.e. given time frame) in turn being an array-of-lists-of-events. the intervals are all of equal sizes, preferably something simple, like days or maybe hours.
For insertion, I lookup the right interval through the hash (create if none exists), and in the interval, the right list-of-events (again create if none exists) and then just insert it, which is O(1).
For processing, I simply take the current interval, and process due events, by processing the currently due list-of-events, and then disposing it. The array stays of constant length, so we are at O(M) (which is quite the best you can get for processing M elements). Once the current interval is entirely processed (thus if the interval now represents the "past"), I simply dispose it at O(1). I can keep an extra reference to the current interval, eliminating the need to look it up, but I guess this doesn't provide any noticable improvement.
It seems to me, the second optimization is really the best solution, since it is fast and unbound. Choosing a good size for the intervals allows optimizing memory overhead vs. hash lookup overhead. I don't know, whether i should worry about the the hash lookup time at all. For high M, it shouldn't really matter, should it? Thus I'd choose an interval size of 1, which leads me back to approach number 3.
I'd be really greatful for any input on that.
来源:https://stackoverflow.com/questions/1503946/lookup-structure-for-handling-future-events-time-based