How to calculate or approximate the median of a list without storing the list

后端未结

关注

 10  1085

I\'m trying to calculate the median of a set of values, but I don\'t want to store all the values as that could blow memory requirements. Is there a way of calculating or ap

相关标签:

10条回答

南方客

2020-11-28 22:32

Usually if the input is within a certain range, say 1 to 1 million, it's easy to create an array of counts: read the code for "quantile" and "ibucket" here: http://code.google.com/p/ea-utils/source/browse/trunk/clipper/sam-stats.cpp

This solution can be generalized as an approximation by coercing the input into an integer within some range using a function that you then reverse on the way out: IE: foo.push((int) input/1000000) and quantile(foo)*1000000.

If your input is an arbitrary double precision number, then you've got to autoscale your histogram as values come in that are out of range (see above).

Or you can use the median-triplets method described in this paper: http://web.cs.wpi.edu/~hofri/medsel.pdf

0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2020-11-28 22:33

If the values are discrete and the number of distinct values isn't too high, you could just accumulate the number of times each value occurs in a histogram, then find the median from the histogram counts (just add up counts from the top and bottom of the histogram until you reach the middle). Or if they're continuous values, you could distribute them into bins - that wouldn't tell you the exact median but it would give you a range, and if you need to know more precisely you could iterate over the list again, examining only the elements in the central bin.

0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-11-28 22:35
I don't think it is possible to do without having the list in memory. You can obviously approximate with
- average if you know that the data is symmetrically distributed
- or calculate a proper median of a small subset of data (that fits in memory) - if you know that your data has the same distribution across the sample (e.g. that the first item has the same distribution as the last one)
0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-11-28 22:36
Here is a crazy approach that you might try. This is a classical problem in streaming algorithms. The rules are
1. You have limited memory, say O(log n) where n is the number of items you want
2. You can look at each item once and make a decision then and there what to do with it, if you store it, it costs memory, if you throw it away it is gone forever.
The idea for the finding a median is simple. Sample O(1 / a^2 * log(1 / p)) * log(n) elements from the list at random, you can do this via reservoir sampling (see a previous question). Now simply return the median from your sampled elements, using a classical method.

The guarantee is that the index of the item returned will be (1 +/- a) / 2 with probability at least 1-p. So there is a probability p of failing, you can choose it by sampling more elements. And it wont return the median or guarantee that the value of the item returned is anywhere close to the median, just that when you sort the list the item returned will be close to the half of the list.

This algorithm uses O(log n) additional space and runs in Linear time.
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-28 22:41
David's suggestion seems like the most sensible approach for approximating the median.

A running mean for the same problem is a much easier to calculate:

M_n = M_n-1 + ((V_n - M_n-1) / n)

Where M_n is the mean of n values, M_n-1 is the previous mean, and V_n is the new value.

In other words, the new mean is the existing mean plus the difference between the new value and the mean, divided by the number of values.

In code this would look something like:
```
new_mean = prev_mean + ((value - prev_mean) / count)
```
though obviously you may want to consider language-specific stuff like floating-point rounding errors etc.
0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2020-11-28 22:48
Find Min and Max of the list containing N items through linear search and name them as HighValue and LowValue Let MedianIndex = (N+1)/2

1st Order Binary Search:

Repeat the following 4 steps until LowValue < HighValue.
1. Get MedianValue approximately = ( HighValue + LowValue ) / 2
2. Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
3. is K = MedianIndex, then return MedianValue
4. is K > MedianIndex ? then HighValue = MedianValue Else LowValue = MedianValue
It will be faster without consuming memory

2nd Order Binary Search:

LowIndex=1 HighIndex=N

Repeat Following 5 Steps until (LowIndex < HighIndex)
1. Get Approximate DistrbutionPerUnit=(HighValue-LowValue)/(HighIndex-LowIndex)
2. Get Approximate MedianValue = LowValue + (MedianIndex-LowIndex) * DistributionPerUnit
3. Get NumberOfItemsWhichAreLessThanorEqualToMedianValue = K
4. is (K=MedianIndex) ? return MedianValue
5. is (K > MedianIndex) ? then HighIndex=K and HighValue=MedianValue Else LowIndex=K and LowValue=MedianValue
It will be faster than 1st order without consuming memory

We can also think of fitting HighValue, LowValue and MedianValue with HighIndex, LowIndex and MedianIndex to a Parabola, and can get ThirdOrder Binary Search which will be faster than 2nd order without consuming memory and so on...
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页