Why is insertion into my tree faster on sorted input than random input?

前端 未结 8 1145
慢半拍i
慢半拍i 2021-02-02 12:33

Now I\'ve always heard binary search trees are faster to build from randomly selected data than ordered data, simply because ordered data requires explicit rebalancing to keep t

8条回答
  •  生来不讨喜
    2021-02-02 12:54

    Self-balancing trees exist to fix the problems associated non-randomly-distributed data. By definition, they trade away a bit of the best-case performance to vastly improve the worst-case performance associated with non-balanced BSTs, specifically that of sorted input.

    You're actually overthinking this problem, because slower insertion of random data vs. ordered data is a characteristic of any balanced tree. Try it on an AVL and you'll see the same results.

    Cameron had the right idea, removing the priority check to force the worst case. If you do that and instrument your tree so you can see how many rebalances are happening for each insert, it actually becomes very obvious what's going on. When inserting sorted data, the tree always rotates left and the root's right child is always empty. Insertion always results in exactly one rebalance because the insertion node has no children and no recursion occurs. On the other hand, when you run it on the random data, almost immediately you start to see multiple rebalances happening on every insert, as many as 5 or 6 of them in the smallest case (50 inserts), because it's happening on subtrees as well.

    With priority checking turned back on, not only are rebalances typically less expensive due to more nodes being pushed into the left subtree (where they never come out of because no insertions happen there), but they are also less likely to occur. Why? Because in the treap, high-priority nodes float to the top, and the constant left-rotations (not accompanied by right-rotations) start to push all the high-priority nodes into the left subtree as well. The result is that rebalances happen less frequently due to the uneven distribution of probability.

    If you instrument the rebalancing code you'll see that this is true; for both the sorted and random input, you end up with almost identical numbers of left-rotations, but the random input also gives the same number of right-rotations, which makes for twice as many in all. This shouldn't be surprising - Gaussian input should result in a Gaussian distribution of rotations. You'll also see that there are only about 60-70% as many top-level rebalances for the sorted input, which perhaps is surprising, and again, that's due to the sorted input messing with the natural distribution of priorities.

    You can also verify this by inspecting the full tree at the end of an insertion loop. With the random input, priorities tend to decrease fairly linearly by level; with the sorted input, priorities tend to stay very high until you get to one or two levels from the bottom.

    Hopefully I've done a decent job explaining this... let me know if any of it is too vague.

提交回复
热议问题