I am currently working on an algorithm to implement a rolling median filter (analogous to a rolling mean filter) in C. From my search of the literature, there appear to be t
It is maybe worth pointing out that there is a special case which has a simple exact solution: when all the values in the stream are integers within a (relatively) small defined range. For instance, assume they must all lie between 0 and 1023. In this case just define an array of 1024 elements and a count, and clear all of these values. For each value in the stream increment the corresponding bin and the count. After the stream ends find the bin which contains the count/2 highest value - easily accomplished by adding successive bins starting from 0. Using the same method the value of an arbitrary rank order may be found. (There is a minor complication if detecting bin saturation and "upgrading" the size of the storage bins to a larger type during a run will be needed.)
This special case may seem artificial, but in practice it is very common. It can also be applied as an approximation for real numbers if they lie within a range and a "good enough" level of precision is known. This would hold for pretty much any set of measurements on a group of "real world" objects. For instance, the heights or weights of a group of people. Not a big enough set? It would work just as well for the lengths or weights of all the (individual) bacteria on the planet - assuming somebody could supply the data!
It looks like I misread the original - which seems like it wants a sliding window median instead of the just the median of a very long stream. This approach still works for that. Load the first N stream values for the initial window, then for the N+1th stream value increment the corresponding bin while decrementing the bin corresponding to the 0th stream value. It is necessary in this case to retain the last N values to allow the decrement, which can be done efficiently by cyclically addressing an array of size N. Since the position of the median can only change by -2,-1,0,1,2 on each step of the sliding window, it isn't necessary to sum all the bins up to the median on each step, just adjust the "median pointer" depending upon which side(s) bins were modified. For instance, if both the new value and the one being removed fall below the current median then it doesn't change (offset = 0). The method breaks down when N becomes too large to hold conveniently in memory.
Here is one that can be used when exact output is not important (for display purposes etc.) You need totalcount and lastmedian, plus the newvalue.
{
totalcount++;
newmedian=lastmedian+(newvalue>lastmedian?1:-1)*(lastmedian==0?newvalue: lastmedian/totalcount*2);
}
Produces quite exact results for things like page_display_time.
Rules: the input stream needs to be smooth on the order of page display time, big in count (>30 etc), and have a non zero median.
Example: page load time, 800 items, 10ms...3000ms, average 90ms, real median:11ms
After 30 inputs, medians error is generally <=20% (9ms..12ms), and gets less and less. After 800 inputs, the error is +-2%.
Another thinker with a similar solution is here: Median Filter Super efficient implementation
I use this incremental median estimator:
median += eta * sgn(sample - median)
which has the same form as the more common mean estimator:
mean += eta * (sample - mean)
Here eta is a small learning rate parameter (e.g. 0.001
), and sgn()
is the signum function which returns one of {-1, 0, 1}
. (Use a constant eta
like this if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources use something like eta = 1 / n
to converge, where n
is the number of samples seen so far.)
Also, I modified the median estimator to make it work for arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p
and 1 - p
. The following estimates this value incrementally:
quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)
The value p
should be within [0, 1]
. This essentially shifts the sgn()
function's symmetrical output {-1, 0, 1}
to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p
and 1 - p
of the data are less than/greater than the quantile estimate, respectively). Note that for p = 0.5
, this reduces to the median estimator.
Based on @mathog thoughts, this is a C# implementation for a running median over an array of bytes with known range of values. Can be extended to other integer types.
/// <summary>
/// Median estimation by histogram, avoids multiple sorting operations for a running median
/// </summary>
public class MedianEstimator
{
private readonly int m_size2;
private readonly byte[] m_counts;
/// <summary>
/// Estimated median, available right after calling <see cref="Init"/> or <see cref="Update"/>.
/// </summary>
public byte Median { get; private set; }
/// <summary>
/// Ctor
/// </summary>
/// <param name="size">Median size in samples</param>
/// <param name="maxValue">Maximum expected value in input data</param>
public MedianEstimator(
int size,
byte maxValue)
{
m_size2 = size / 2;
m_counts = new byte[maxValue + 1];
}
/// <summary>
/// Initializes the internal histogram with the passed sample values
/// </summary>
/// <param name="values">Array of values, usually the start of the array for a running median</param>
public void Init(byte[] values)
{
for (var i = 0; i < values.Length; i++)
m_counts[values[i]]++;
UpdateMedian();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private void UpdateMedian()
{
// The median is the first value up to which counts add to size / 2
var sum = 0;
Median = 0;
for (var i = 0; i < m_counts.Length; i++)
{
sum += m_counts[i];
Median = (byte) i;
if (sum > m_size2) break;
}
}
/// <summary>
/// Updates the median estimation by removing <paramref name="last"/> and adding <paramref name="next"/>. These
/// values must be updated as the running median is applied. If the median length is <i>N</i>, at the sample
/// <i>i</i>, <paramref name="last"/> is sample at index <i>i</i>-<i>N</i>/2 and <paramref name="next"/> is sample
/// at index <i>i</i>+<i>N</i>/2+1.
/// </summary>
/// <param name="last">Sample at the start of the moving window that is to be removed</param>
/// <param name="next">Sample at the end of the moving window + 1 that is to be added</param>
public void Update(byte last, byte next)
{
m_counts[last]--;
m_counts[next]++;
// The conditions below do not change median value so there is no need to update it
if (last == next ||
last < Median && next < Median || // both below median
last > Median && next > Median) // both above median
return;
UpdateMedian();
}
Testing against a running median, with timing:
private void TestMedianEstimator()
{
var r = new Random();
const int SIZE = 15;
const byte MAX_VAL = 80;
var values = new byte[100000];
for (var i = 0; i < values.Length; i++)
values[i] = (byte) (MAX_VAL * r.NextDouble());
var timer = Stopwatch.StartNew();
// Running median
var window = new byte[2 * SIZE + 1];
var medians = new byte[values.Length];
for (var i = SIZE; i < values.Length - SIZE - 1; i++)
{
for (int j = i - SIZE, k = 0; j <= i + SIZE; j++, k++)
window[k] = values[j];
Array.Sort(window);
medians[i] = window[SIZE];
}
timer.Stop();
var elapsed1 = timer.Elapsed;
timer.Restart();
var me = new MedianEstimator(2 * SIZE + 1, MAX_VAL);
me.Init(values.Slice(0, 2 * SIZE + 1));
var meMedians = new byte[values.Length];
for (var i = SIZE; i < values.Length - SIZE - 1; i++)
{
meMedians[i] = me.Median;
me.Update(values[i - SIZE], values[i + SIZE + 1]);
}
timer.Stop();
var elapsed2 = timer.Elapsed;
WriteLineToLog($"{elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds:0.00}");
var diff = 0;
for (var i = 0; i < meMedians.Length; i++)
diff += Math.Abs(meMedians[i] - medians[i]);
WriteLineToLog($"Diff: {diff}");
}
For those who need a running median in Java...PriorityQueue is your friend. O(log N) insert, O(1) current median, and O(N) remove. If you know the distribution of your data you can do a lot better than this.
public class RunningMedian {
// Two priority queues, one of reversed order.
PriorityQueue<Integer> lower = new PriorityQueue<Integer>(10,
new Comparator<Integer>() {
public int compare(Integer arg0, Integer arg1) {
return (arg0 < arg1) ? 1 : arg0 == arg1 ? 0 : -1;
}
}), higher = new PriorityQueue<Integer>();
public void insert(Integer n) {
if (lower.isEmpty() && higher.isEmpty())
lower.add(n);
else {
if (n <= lower.peek())
lower.add(n);
else
higher.add(n);
rebalance();
}
}
void rebalance() {
if (lower.size() < higher.size() - 1)
lower.add(higher.remove());
else if (higher.size() < lower.size() - 1)
higher.add(lower.remove());
}
public Integer getMedian() {
if (lower.isEmpty() && higher.isEmpty())
return null;
else if (lower.size() == higher.size())
return (lower.peek() + higher.peek()) / 2;
else
return (lower.size() < higher.size()) ? higher.peek() : lower
.peek();
}
public void remove(Integer n) {
if (lower.remove(n) || higher.remove(n))
rebalance();
}
}
Here is the java implementation
package MedianOfIntegerStream;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;
public class MedianOfIntegerStream {
public Set<Integer> rightMinSet;
public Set<Integer> leftMaxSet;
public int numOfElements;
public MedianOfIntegerStream() {
rightMinSet = new TreeSet<Integer>();
leftMaxSet = new TreeSet<Integer>(new DescendingComparator());
numOfElements = 0;
}
public void addNumberToStream(Integer num) {
leftMaxSet.add(num);
Iterator<Integer> iterMax = leftMaxSet.iterator();
Iterator<Integer> iterMin = rightMinSet.iterator();
int maxEl = iterMax.next();
int minEl = 0;
if (iterMin.hasNext()) {
minEl = iterMin.next();
}
if (numOfElements % 2 == 0) {
if (numOfElements == 0) {
numOfElements++;
return;
} else if (maxEl > minEl) {
iterMax.remove();
if (minEl != 0) {
iterMin.remove();
}
leftMaxSet.add(minEl);
rightMinSet.add(maxEl);
}
} else {
if (maxEl != 0) {
iterMax.remove();
}
rightMinSet.add(maxEl);
}
numOfElements++;
}
public Double getMedian() {
if (numOfElements % 2 != 0)
return new Double(leftMaxSet.iterator().next());
else
return (leftMaxSet.iterator().next() + rightMinSet.iterator().next()) / 2.0;
}
private class DescendingComparator implements Comparator<Integer> {
@Override
public int compare(Integer o1, Integer o2) {
return o2 - o1;
}
}
public static void main(String[] args) {
MedianOfIntegerStream streamMedian = new MedianOfIntegerStream();
streamMedian.addNumberToStream(1);
System.out.println(streamMedian.getMedian()); // should be 1
streamMedian.addNumberToStream(5);
streamMedian.addNumberToStream(10);
streamMedian.addNumberToStream(12);
streamMedian.addNumberToStream(2);
System.out.println(streamMedian.getMedian()); // should be 5
streamMedian.addNumberToStream(3);
streamMedian.addNumberToStream(8);
streamMedian.addNumberToStream(9);
System.out.println(streamMedian.getMedian()); // should be 6.5
}
}