I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we sho
You want to solve inequality that rougly might be described as
t * n > C * n * log(n) + t * log(n)
where t
is number of checks and C
is some constant for sort implementation (should be determined experimentally). When you evaluate this constant, you can solve inequality numerically (with uncertainty, of course)
Assuming n
elements and m
searches, with crude approximations
the cost of the sort will be C0.n.log n
,
the cost of the m
binary searches C1.m.log n
,
the cost of the m
linear searches C2.m.n
,
with C2 ~ C1 < C0
.
Now you compare
C0.n.log n + C1.m.log n vs. C2.m.n
or
C0.n.log n / (C2.n - C1.log n) vs. m
For reasonably large n
, the breakeven point is about C0.log n / C2
.
For instance, taking C0 / C2 = 5
, n = 1000000
gives m = 100
.
This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n)
and the question is to find b
via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n)
and log(b)
use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs
and filled it with C(n)
and ys
and filled it with n*log(n)
and used numpy
to tell me the slope of that least squares fit, which I expect to equal 1 / log(b)
. I ran the script and got b
inside of [2.16, 2.3]
depending on how high I set n
to (I varied n from 100 to 100'000'000). The fact that b
seems to vary depending on n
shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n)
. I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N
you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N
exceeds the quantity on the RHS (where n
is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.
You should plot the complexities of both operations.
Linear search: O(n)
Sort and binary search: O(nlogn + logn)
In the plot, you will see for which values of n
it makes sense to choose the one approach over the other.
Like you already pointed out, it depends on the number of searches you want to do. A good threshold can come out of the following statement:
n*log[b](n) + x*log[2](n) <= x*n/2
x is the number of searches; n the input size; b the base of the logarithm for the sort, depending on the partitioning you use.
When this statement evaluates to true, you should switch methods from linear search to sort and search.
Generally speaking, a linear search through an unordered array will take n/2 steps on average, though this average will only play a big role once x approaches n. If you want to stick with big Omicron or big Theta notation then you can omit the /2
in the above.