I know that it is possible to calculate the mean of a list of numbers in O(n). But what about the median? Is there any better algorithm than sort (O(n log n)) and lookup middl
Try the randomized algorithm, the sampling size (e.g. 2000) is independent from the data size n, still be able to get sufficiently high (99%) accuracy. If you need higher accuracy, just increase sampling size. Using Chernoff bound can proof the probability under a certain sampling size. I've write some JavaScript Code to implement the algorithm, feel free to take it. http://www.sfu.ca/~wpa10
What you're talking about is a selection algorithm, where k = n/2
. There is a method based on the same partitioning function used in quicksort which works. It is called, not surprisingly, quickselect. While it can, like quicksort, have a O(n2) worst case, this can be brought down to linear time using the proper pivot selection.
This link has popped up recently on calculating median: http://matpalm.com/median/question.html .
In general I think you can't go beyond O(n log n) time, but I don't have any proof on that :). No matter how much you make it parallel, aggregating the results into a single value takes at least log n levels of execution.
Just for fun (and who knows, it may be faster) there's another randomized median algorithm, explained technically in Mitzenmacher's and Upfall's book. Basically, you choose a polynomially-smaller subset of the list, and (with some fancy bookwork) such that it probably contains the real median, and then use it to find the real median. The book is on google books, and here's a link. Note: I was able to read the pages of the algorthm, so assuming that google books reveals the same pages to everyone, you can read them too.
It is a randomized algorithm s.t. if it finds the answer, it is 100% certain that it is the correct answer (this is called Las Vegas style). The randomness arises from the runtime --- occasionally (with probability 1/(sqrt(n)), I think) it FAILS to find the median, and must be re-run.
Asymptotically, it is exactly linear when you take into the chance of failure --- that is to say, it is a wee bit less than linear, exactly such that when you take into account the number of times you may need to re-run it, it becomes linear.
Note: I'm not saying this is better or worse --- I certainly haven't done a real-life runtime comparison between these algorithms! I'm simply presenting an additional algorithm that has linear runtime, but works in a significantly different way.
If the numbers are discrete (e.g. integers) and there is a manageable number of distinct values, you can use a "bucket sort" which is O(N), then iterate over the buckets to figure out which bucket holds the median. The complete calculation is O(N) in time and O(B) in space.
Yes. You can do it (deterministically) in O(n).