find median in a fixed-size moving window along a long sequence of data

前端 未结 5 1173
挽巷
挽巷 2021-02-04 09:41

Given a sequence of data (it may have duplicates), a fixed-sized moving window, move the window at each iteration from the start of the data sequence, such that (1) the oldes

相关标签:
5条回答
  • 2021-02-04 10:06

    I gave this answer for the "rolling median in C" question

    I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link ( Match Editorial: scroll down to FloatingMedian).

    Two multisets

    The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.

    Segment tree

    The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!! (And you only need 65536 * sizeof(counting_type) bytes, e.g. 65536*4).

    GNU Order Statistic Trees

    Just before giving up, I found that stdlibc++ contains order statistic trees!!!

    These have two critical operations:

    iter = tree.find_by_order(value)
    order = tree.order_of_key(value)
    

    See libstdc++ manual policy_based_data_structures_test (search for "split and join").

    I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:

    #if !defined(GNU_ORDER_STATISTIC_SET_H)
    #define GNU_ORDER_STATISTIC_SET_H
    #include <ext/pb_ds/assoc_container.hpp>
    #include <ext/pb_ds/tree_policy.hpp>
    
    // A red-black tree table storing ints and their order
    // statistics. Note that since the tree uses
    // tree_order_statistics_node_update as its update policy, then it
    // includes its methods by_order and order_of_key.
    template <typename T>
    using t_order_statistic_set = __gnu_pbds::tree<
                                      T,
                                      __gnu_pbds::null_type,
                                      std::less<T>,
                                      __gnu_pbds::rb_tree_tag,
                                      // This policy updates nodes'  metadata for order statistics.
                                      __gnu_pbds::tree_order_statistics_node_update>;
    
    #endif //GNU_ORDER_STATISTIC_SET_H
    
    0 讨论(0)
  • 2021-02-04 10:14

    O(n*lg m) is easy:

    Just maintain your window as two std::sets, one for the lower half, one for the upper half. Insertion of a new element costs O(lg m), finding and removal of an old element costs the same. Determining the median using the method you described in your question costs O(1).

    As you slide the window over your sequence, in each iteration you remove the item falling out of the window (O(lg m)), insert the new item (O(lg m)) and compute the median (O(1)), resulting in a total of O(n lg m).

    This solution uses space O(m), of course but I don't think you can get away without storing the window's contents.

    0 讨论(0)
  • 2021-02-04 10:17

    I have implemented almost exactly the algorithm you describe here: http://ideone.com/8VVEa, and described it here: Rolling median in C - Turlach implementation

    The way to get around the "find oldest" problem is to keep the values in a circular buffer, so you always have a pointer to the oldest. What you store in the heap are buffer indexes. So the space requirement is 2M, and each update is O(lg M).

    0 讨论(0)
  • 2021-02-04 10:17

    I attach my segment tree (see my other post) which allows the frequency distribution of counts to be queried very efficiently.

    This implements the following data structure:

    |-------------------------------|
    |---------------|---------------|
    |-------|-------|-------|-------|
    |---|---|---|---|---|---|---|---|
      0   1   2   3   4   5   6   7  
    

    Each segment keeps the number of counts items in the range it covers. I use 2N segments for a range of value from 1..N. These are placed in a single rolled out vector rather than the tree format show figuratively above.

    So if you are calculating rolling medians over a set of integers which vary from 1..65536, then you only need 128kb to store them, and can insert/delete/query using O(ln N) where N = the size of the range, i.e. 2**16 operations.

    This is a big win if the data range is much smaller than your rolling window.

    #if !defined(SEGMENT_TREE_H)
    #define SEGMENT_TREE_H
    #include <cassert>
    #include <array>
    #include <algorithm>
    #include <set>
    
    #ifndef NDEBUG
    #include <set>
    #endif
    
    template<typename COUNTS, unsigned BITS>
    class t_segment_tree
    {
        static const unsigned                       cnt_elements    = (1 << BITS);
        static const unsigned                       cnt_storage     = cnt_elements << 1;
        std::array<COUNTS, cnt_elements * 2 - 1>    counts;
        unsigned                                    count;
    
    #ifndef NDEBUG
        std::multiset<unsigned>                     elements;
    #endif
        public:
    
        //____________________________________________________________________________________
    
        //  constructor
    
        //____________________________________________________________________________________
        t_segment_tree(): count(0)
        {
            std::fill_n(counts.begin(), counts.size(),  0);
        }
        //~t_segment_tree();
    
        //____________________________________________________________________________________
    
        //  size
    
        //____________________________________________________________________________________
        unsigned size() const  { return count; }
    
        //____________________________________________________________________________________
    
        //  constructor
    
        //____________________________________________________________________________________
        void insert(unsigned x)
        {
    #ifndef NDEBUG
            elements.insert(x);
            assert("...............This element is too large for the number of BITs!!..............." && cnt_elements > x);
    #endif
            unsigned ii = x + cnt_elements;
            while (ii)
            {
                ++counts[ii - 1];
                ii >>= 1;
            }
            ++count;
        }
    
        //____________________________________________________________________________________
    
        //  erase 
    
        //      assumes erase is in the set
        //____________________________________________________________________________________
        void erase(unsigned x)
        {
    #ifndef NDEBUG
            // if the assertion failed here, it means that x was never "insert"-ed in the first place
            assert("...............This element was not 'insert'-ed before it is being 'erase'-ed!!..............." && elements.count(x));
            elements.erase(elements.find(x));
    #endif
            unsigned ii = x + cnt_elements;
            while (ii)
            {
                --counts[ii - 1];
                ii >>= 1;
            }
            --count;
        }
    
        // 
        //____________________________________________________________________________________
    
        //  kth element
    
        //____________________________________________________________________________________
        unsigned operator[](unsigned k)
        {
            assert("...............The kth element: k needs to be smaller than the number of elements!!..............." && k < size());
            unsigned ii = 1;
            while (ii < cnt_storage)
            {
                if (counts[ii - 1] <= k)
                   k -= counts[ii++ - 1];
                ii <<= 1;
            }
            return (ii >> 1) - cnt_elements;
        }
    
    };
    #endif
    
    0 讨论(0)
  • 2021-02-04 10:26

    Same answer as hc_ but instead of using a stock BST use a version where every node has the count of elements in that sub-tree. This way finding median is O(log(m)).

    0 讨论(0)
提交回复
热议问题