Given a sequence of data (it may have duplicates), a fixed-sized moving window, move the window at each iteration from the start of the data sequence, such that (1) the oldes
I gave this answer for the "rolling median in C" question
I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link ( Match Editorial: scroll down to FloatingMedian).
The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.
The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!! (And you only need 65536 * sizeof(counting_type) bytes, e.g. 65536*4).
Just before giving up, I found that stdlibc++ contains order statistic trees!!!
These have two critical operations:
iter = tree.find_by_order(value)
order = tree.order_of_key(value)
See libstdc++ manual policy_based_data_structures_test (search for "split and join").
I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:
#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
T,
__gnu_pbds::null_type,
std::less<T>,
__gnu_pbds::rb_tree_tag,
// This policy updates nodes' metadata for order statistics.
__gnu_pbds::tree_order_statistics_node_update>;
#endif //GNU_ORDER_STATISTIC_SET_H
O(n*lg m) is easy:
Just maintain your window as two std::set
s, one for the lower half, one for the upper half. Insertion of a new element costs O(lg m), finding and removal of an old element costs the same. Determining the median using the method you described in your question costs O(1).
As you slide the window over your sequence, in each iteration you remove the item falling out of the window (O(lg m)), insert the new item (O(lg m)) and compute the median (O(1)), resulting in a total of O(n lg m).
This solution uses space O(m), of course but I don't think you can get away without storing the window's contents.
I have implemented almost exactly the algorithm you describe here: http://ideone.com/8VVEa, and described it here: Rolling median in C - Turlach implementation
The way to get around the "find oldest" problem is to keep the values in a circular buffer, so you always have a pointer to the oldest. What you store in the heap are buffer indexes. So the space requirement is 2M, and each update is O(lg M).
I attach my segment tree (see my other post) which allows the frequency distribution of counts to be queried very efficiently.
This implements the following data structure:
|-------------------------------|
|---------------|---------------|
|-------|-------|-------|-------|
|---|---|---|---|---|---|---|---|
0 1 2 3 4 5 6 7
Each segment keeps the number of counts items in the range it covers. I use 2N segments for a range of value from 1..N. These are placed in a single rolled out vector rather than the tree format show figuratively above.
So if you are calculating rolling medians over a set of integers which vary from 1..65536, then you only need 128kb to store them, and can insert/delete/query using O(ln N) where N = the size of the range, i.e. 2**16 operations.
This is a big win if the data range is much smaller than your rolling window.
#if !defined(SEGMENT_TREE_H)
#define SEGMENT_TREE_H
#include <cassert>
#include <array>
#include <algorithm>
#include <set>
#ifndef NDEBUG
#include <set>
#endif
template<typename COUNTS, unsigned BITS>
class t_segment_tree
{
static const unsigned cnt_elements = (1 << BITS);
static const unsigned cnt_storage = cnt_elements << 1;
std::array<COUNTS, cnt_elements * 2 - 1> counts;
unsigned count;
#ifndef NDEBUG
std::multiset<unsigned> elements;
#endif
public:
//____________________________________________________________________________________
// constructor
//____________________________________________________________________________________
t_segment_tree(): count(0)
{
std::fill_n(counts.begin(), counts.size(), 0);
}
//~t_segment_tree();
//____________________________________________________________________________________
// size
//____________________________________________________________________________________
unsigned size() const { return count; }
//____________________________________________________________________________________
// constructor
//____________________________________________________________________________________
void insert(unsigned x)
{
#ifndef NDEBUG
elements.insert(x);
assert("...............This element is too large for the number of BITs!!..............." && cnt_elements > x);
#endif
unsigned ii = x + cnt_elements;
while (ii)
{
++counts[ii - 1];
ii >>= 1;
}
++count;
}
//____________________________________________________________________________________
// erase
// assumes erase is in the set
//____________________________________________________________________________________
void erase(unsigned x)
{
#ifndef NDEBUG
// if the assertion failed here, it means that x was never "insert"-ed in the first place
assert("...............This element was not 'insert'-ed before it is being 'erase'-ed!!..............." && elements.count(x));
elements.erase(elements.find(x));
#endif
unsigned ii = x + cnt_elements;
while (ii)
{
--counts[ii - 1];
ii >>= 1;
}
--count;
}
//
//____________________________________________________________________________________
// kth element
//____________________________________________________________________________________
unsigned operator[](unsigned k)
{
assert("...............The kth element: k needs to be smaller than the number of elements!!..............." && k < size());
unsigned ii = 1;
while (ii < cnt_storage)
{
if (counts[ii - 1] <= k)
k -= counts[ii++ - 1];
ii <<= 1;
}
return (ii >> 1) - cnt_elements;
}
};
#endif
Same answer as hc_ but instead of using a stock BST use a version where every node has the count of elements in that sub-tree. This way finding median is O(log(m)).