Balancing KD-Tree: Which approach is more efficient?

I'm trying to balance a set of (Million +) 3D points using a KD-tree and I have two ways of doing it.

Way 1:

Use an O(n) algorithm to find the arraysize/2-th largest element along a given axis and store it at the current node
Iterate over all the elements in the vector and for each, compare them to the element I just found and put those smaller in newArray1, and those larger in newArray2
Recurse

Way 2:

Use quicksort O(nlogn) to sort all the elements in the array along a given axis, take the element at position arraysize/2 and store it in the current node.
Then put all the elements from index 0 to arraysize/2-1 in newArray1, and those from arraysize/2 to arraysize-1 in newArray2
Recurse

Way 2 seems more "elegant" but way 1 seems faster since the median search and the iterating are both O(n) so I get O(2n) which just reduces to O(n). But then at the same time, even though way 2 is O(nlogn) time to sort, splitting up the array into 2 can be done in constant time, but does it make up for the O(nlogn) time for sorting?

What should I do? Or is there an even better way to do this that I'm not even seeing?

How about Way 3:

Use an O(n) algorithm such as QuickSelect to ensure that the element at position length/2 is the correct element, all elements before are less, and all afterwards are larger than it (without sorting them completely!) - this is probably the algorithm you used in your Way 1 step 1 anyway...
Recurse into each half (except middle element) and repeat with next axis.

Note that you actually do not need to make "node" objects. You can actually keep the tree in a large array. When searching, start at length/2 with the first axis.

I've seen this trick being used by ELKI. It uses very little memory and code, which makes the tree quite fast.

Another way:

Sort for each of the dimensions: O(K N log N). This will be performed only once, we will utilize the sorted list on the dimensions.

For the current dimension, find the median in O(1) time, split for the median in O(N) time, split also the sorted arrays for each of the dimensions in O(KN) time, and recurse for the next dimension.

In that way, you will perform sorts at the beginning. And perform (K+1) splits/filterings for each subtree, for a known value. For small K, this approach should be faster than the other approaches.

Note: The additional space needed for the algorithm can be decreased by the tricks pointed out by Anony-Mousse.

Notice that if the query hyper-rectangle contains many points (all of them for example) it does not matter if the tree is balanced or not. A balanced tree is useful if the query hyper-rects are small.

来源：https://stackoverflow.com/questions/17021379/balancing-kd-tree-which-approach-is-more-efficient

标签

performance

algorithm

sorting

tree

kdtree