find median in a fixed-size moving window along a long sequence of data

前端未结

关注

 5  1173

Given a sequence of data (it may have duplicates), a fixed-sized moving window, move the window at each iteration from the start of the data sequence, such that (1) the oldes

相关标签:

5条回答

孤街浪徒

2021-02-04 10:06
I gave this answer for the "rolling median in C" question

I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link ( Match Editorial: scroll down to FloatingMedian).

Two multisets

The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.

Segment tree

The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!! (And you only need 65536 * sizeof(counting_type) bytes, e.g. 65536*4).

GNU Order Statistic Trees

Just before giving up, I found that stdlibc++ contains order statistic trees!!!

These have two critical operations:
```
iter = tree.find_by_order(value)
order = tree.order_of_key(value)
```
See libstdc++ manual policy_based_data_structures_test (search for "split and join").

I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:
```
#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2021-02-04 10:14

O(n*lg m) is easy:

Just maintain your window as two std::sets, one for the lower half, one for the upper half. Insertion of a new element costs O(lg m), finding and removal of an old element costs the same. Determining the median using the method you described in your question costs O(1).

As you slide the window over your sequence, in each iteration you remove the item falling out of the window (O(lg m)), insert the new item (O(lg m)) and compute the median (O(1)), resulting in a total of O(n lg m).

This solution uses space O(m), of course but I don't think you can get away without storing the window's contents.

0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2021-02-04 10:17

I have implemented almost exactly the algorithm you describe here: http://ideone.com/8VVEa, and described it here: Rolling median in C - Turlach implementation

The way to get around the "find oldest" problem is to keep the values in a circular buffer, so you always have a pointer to the oldest. What you store in the heap are buffer indexes. So the space requirement is 2M, and each update is O(lg M).

0 讨论(0)
发布评论:

提交评论
- 加载中...

不思量自难忘°

2021-02-04 10:17

I attach my segment tree (see my other post) which allows the frequency distribution of counts to be queried very efficiently.

This implements the following data structure:

|-------------------------------|
|---------------|---------------|
|-------|-------|-------|-------|
|---|---|---|---|---|---|---|---|
  0   1   2   3   4   5   6   7

Each segment keeps the number of counts items in the range it covers. I use 2N segments for a range of value from 1..N. These are placed in a single rolled out vector rather than the tree format show figuratively above.

So if you are calculating rolling medians over a set of integers which vary from 1..65536, then you only need 128kb to store them, and can insert/delete/query using O(ln N) where N = the size of the range, i.e. 2**16 operations.

This is a big win if the data range is much smaller than your rolling window.

#if !defined(SEGMENT_TREE_H)
#define SEGMENT_TREE_H
#include <cassert>
#include <array>
#include <algorithm>
#include <set>

#ifndef NDEBUG
#include <set>
#endif

template<typename COUNTS, unsigned BITS>
class t_segment_tree
{
    static const unsigned                       cnt_elements    = (1 << BITS);
    static const unsigned                       cnt_storage     = cnt_elements << 1;
    std::array<COUNTS, cnt_elements * 2 - 1>    counts;
    unsigned                                    count;

#ifndef NDEBUG
    std::multiset<unsigned>                     elements;
#endif
    public:

    //____________________________________________________________________________________

    //  constructor

    //____________________________________________________________________________________
    t_segment_tree(): count(0)
    {
        std::fill_n(counts.begin(), counts.size(),  0);
    }
    //~t_segment_tree();

    //____________________________________________________________________________________

    //  size

    //____________________________________________________________________________________
    unsigned size() const  { return count; }

    //____________________________________________________________________________________

    //  constructor

    //____________________________________________________________________________________
    void insert(unsigned x)
    {
#ifndef NDEBUG
        elements.insert(x);
        assert("...............This element is too large for the number of BITs!!..............." && cnt_elements > x);
#endif
        unsigned ii = x + cnt_elements;
        while (ii)
        {
            ++counts[ii - 1];
            ii >>= 1;
        }
        ++count;
    }

    //____________________________________________________________________________________

    //  erase 

    //      assumes erase is in the set
    //____________________________________________________________________________________
    void erase(unsigned x)
    {
#ifndef NDEBUG
        // if the assertion failed here, it means that x was never "insert"-ed in the first place
        assert("...............This element was not 'insert'-ed before it is being 'erase'-ed!!..............." && elements.count(x));
        elements.erase(elements.find(x));
#endif
        unsigned ii = x + cnt_elements;
        while (ii)
        {
            --counts[ii - 1];
            ii >>= 1;
        }
        --count;
    }

    // 
    //____________________________________________________________________________________

    //  kth element

    //____________________________________________________________________________________
    unsigned operator[](unsigned k)
    {
        assert("...............The kth element: k needs to be smaller than the number of elements!!..............." && k < size());
        unsigned ii = 1;
        while (ii < cnt_storage)
        {
            if (counts[ii - 1] <= k)
               k -= counts[ii++ - 1];
            ii <<= 1;
        }
        return (ii >> 1) - cnt_elements;
    }

};
#endif

0 讨论(0)

遥遥无期

2021-02-04 10:26

Same answer as hc_ but instead of using a stock BST use a version where every node has the count of elements in that sub-tree. This way finding median is O(log(m)).

0 讨论(0)
发布评论:

提交评论
- 加载中...

find median in a fixed-size moving window along a long sequence of data

Two multisets

Segment tree

GNU Order Statistic Trees