Fast Algorithm to Quickly Find the Range a Number Belongs to in a Set of Ranges?

后端 未结 5 1596
情书的邮戳
情书的邮戳 2020-11-29 00:53

The Scenario

I have several number ranges. Those ranges are not overlapping - as they are not overlapping, the logical consequence is that no number can be part of

相关标签:
5条回答
  • 2020-11-29 01:31

    A balanced, sorted tree with ranges on each node seems to be the answer. I can't prove it's optimal, but if I were you I wouldn't look any further.

    0 讨论(0)
  • 2020-11-29 01:31

    As an alternative to O(log n) balanced binary search trees (BST), you could consider building a bitwise (compressed) trie. I.e. a prefix tree on the bits of the numbers you're storing.

    This gives you O(w)-search, insert and delete performance; where w = number of bits (e.g. 32 or 64 minus whatever power of 2 your ranges were based on).

    Not saying that it'll perform better or worse, but it seems like a true alternative in the sense it is different from BST but still has good theoretic performance and allows for predecessor queries just like BST.

    0 讨论(0)
  • 2020-11-29 01:32

    Create a sorted list and sort by the lower margin / start. That's easiest to implement and fast enough unless you have millions of ranges (and maybe even then).

    When looking for a range, find the range where start <= position. You can use a binary search here since the list is sorted. The number is in the range if position <= end.

    Since the end of any range is guaranteed to be smaller than start of the next range, you don't need to care about the end until you have found a range where the position might be contained.

    All other data structures become interesting when you get intersections or you have a whole lot of ranges and when you build the structure one and query often.

    0 讨论(0)
  • 2020-11-29 01:33

    After running various benchmarks, I came to the conclusion that only a tree like structure can work here. A sorted list shows of course good lookup performance - O(log n) - but it shows horribly update performance (inserts and removals are slower by more than the factor 10 compared to trees!).

    A balanced binary tree also has O(log n) lookup performance, however it is much faster to update, also around O(log n), while a sorted list is more like O(n) for updates (O(log n) to find the position for insert or the element to delete, but then up to n elements must be moved within the list and this is O(n)).

    I implemented an AVL tree, a red-black tree, a Treap, an AA-Tree and various variations of B-Trees (B means Bayer Tree here, not Binary). Result: Bayer trees almost never win. Their lookup is good, but their update performance is bad (as within each node of a B-Tree you have a sorted list again!). Bayer trees are only superior in cases where reading/writing a node is a very slow operation (e.g. when the nodes are directly read or written from/to hard disk) - as a B-Tree must read/write much less nodes than any other tree, so in such a case it will win. If we are having the tree in memory though, it stands no chance against other trees, sorry for all the B-Tree fans out there.

    A Treap was easiest to implement (less than half the lines of code you need for other balanced trees, only twice the code you need for an unbalanced tree) and shows good average performance for lookups and updates... but we can do better than that.

    An AA-Tree shows amazing good lookup performance - I have no idea why. They sometimes beat all other trees (not by far, but still enough to not be coincident)... and the removal performance is okay, however unless I'm too stupid to implement them correctly, the insert performance is really bad (it performs much more tree rotations on every insert than any other tree - even B-Trees have faster insert performance).

    This leaves us with two classics, AVL and RB-Tree. They are both pretty similar but after hours of benchmarking, one thing is clear: AVL Trees definitely have better lookup performance than RB-Trees. The difference is not gigantic, but in 2/3 out of all benchmarks they will win the lookup test. Not too surprising, after all AVL Trees are more strictly balanced than RB-Trees, so they are closer to the optimal binary tree in most cases. We are not talking about a huge difference here, it is always a close race.

    On the other hand RB Trees beat AVL Trees for inserts in almost all test runs and that is not such a close race. As before, that is expected. Being less strictly balanced RB Trees perform much less tree rotations on inserts compared to AVL Trees.

    How about removal of nodes? Here it seems to depend a lot on the number of nodes. For small node numbers (everything less than half a million) RB Trees again own AVL Trees; the difference is even bigger than for inserts. Rather unexpected is that once the node number grows beyond a million nodes AVL Trees seems to catch up and the difference to RB Trees shrinks until they are more or less equally fast. This could be an effect of the system, though. It could have to do with memory usage of the process or CPU caching or the like. Something that has a more negative effect on RB Trees than it has on AVL Trees and thus AVL Trees can catch up. The same effect is not observed for lookups (AVL usually faster, regardless how many nodes) and inserts (RB usually faster, regardless how many nodes).

    Conclusion:
    I think the fastest I can get is when using RB-Trees, since the number of lookups will only be somewhat higher than the number of inserts and deletions and no matter how fast AVL is on lookups, the overall performance will suffer from their worse insert/deletion performance.

    That is, unless anyone here may come up with a much better data structure that will own RB Trees big time ;-)

    0 讨论(0)
  • 2020-11-29 01:36

    If the total range of numbers is low, and you have enough memory, you could create a huge table with all the numbers.

    For example, if you have one million of numbers, you can create a table that references the range object.

    0 讨论(0)
提交回复
热议问题