How to find minimal-length subsequence that contains all element of a sequence

后端 未结 7 1444
无人共我
无人共我 2021-02-01 07:29

Given a sequence such as S = {1,8,2,1,4,1,2,9,1,8,4}, I need to find the minimal-length subsequence that contains all element of S (no duplicates, order does n

相关标签:
7条回答
  • 2021-02-01 07:59

    Here is an algorithm that requires O(N) time and O(N) space. It is similar to that one by Grigor Gevorgyan. It also uses an auxiliary O(N) array of flags. The algorithm finds the longest subsequence of unique elements. If bestLength < numUnique then there is no subsequence containing all unique elements. The algorithm assumes that the elements are positive numbers and that the maximal element is less than the length of the sequence.

    bool findLongestSequence() {
        // Data (adapt as needed)
        const int N = 13;
        char flags[N];
        int a[] = {1,8,2,1,4,1,2,9,1,8,1,4,1};
    
        // Number of unique elements
        int numUnique = 0;
        for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
        for (int n = 0; n < N; ++n) {
            if (a[n] < 0 || a[n] >= N) return false; // assumptions violated 
            if (flags[a[n]] == 0) {
                ++numUnique;
                flags[a[n]] = 1;
            }
        }
    
        // Find the longest sequence ("best")
        for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
        int bestBegin = 0, bestLength = 0;
        int begin = 0, end = 0, currLength = 0;
        for (; begin < N; ++begin) {
            while (end < N) {
                if (flags[a[end]] == 0) {
                    ++currLength;
                    flags[a[end]] = 1;
                    ++end;
                }
                else {
                    break; // end-loop
                }
            }
            if (currLength > bestLength) {
                bestLength = currLength;
                bestBegin = begin;
            }
            if (bestLength >= numUnique) {
                break; // begin-loop
            }
            flags[a[begin]] = 0; // reset
            --currLength;
        }
    
        cout << "numUnique = " << numUnique << endl;
        cout << "bestBegin = " << bestBegin << endl;
        cout << "bestLength = " << bestLength << endl;
        return true; // longest subseqence found 
    }
    
    0 讨论(0)
  • 2021-02-01 08:09

    I've got a O(N*M) algorithm where N is the length of S, and M is the number of elements (it tend to works better for small values of M, i.e : if there are very few duplicates, it may be a bad algorithm with quadratic cost) Edit : It seems that in fact, it's much closer to O(N) in practise. You get O(N*M) only in worst case scenarios

    Start by going through the sequence and record all the elements of S. Let's call this set E.

    We're going to work with a dynamic subsequence of S. Create an empty map M where M associates to each element the number of times it is present in the subsequence.

    For example, if subSequence = {1,8,2,1,4}, and E = {1, 2, 4, 8, 9}

    • M[9]==0
    • M[2]==M[4]==M[8]==1
    • M[1]==2

    You'll need two index, that will each point to an element of S. One of them will be called L because he's at the left of the subsequence formed by those two indexes. The other one will be called R as it's the index of the right part of the subsequence.

    Begin by initializing L=0,R=0 and M[S[0]]++

    The algorithm is :

    While(M does not contain all the elements of E)
    {
        if(R is the end of S)
          break
      R++
      M[S[R]]++ 
    }
    While(M contains all the elements of E)
    {
      if(the subsequence S[L->R] is the shortest one seen so far)
        Record it
      M[S[L]]--
      L++
    }
    

    To check if M contains all the elements of E, you can have a vector of booleans V. V[i]==true if M[E[i]]>0 and V[i]==false if M[E[i]]==0. So you begin by setting all the values of V at false, and each time you do M[S[R]]++, you can set V of this element to true, and each time you do M[S[L]]-- and M[S[L]]==0 then set V of this element to false

    0 讨论(0)
  • 2021-02-01 08:11

    Algorithm:

    First, determine the quantity of different elements in the array - this can be easily done in linear time. Let there be k different elements.

    Allocate an array cur of size 10^5, each showing how much of each element is used in current subsequence (see later).

    Hold a cnt variable showing how many different elements are there currently in the considered sequence. Now, take two indexes, begin and end and iterate them through the array the following way:

    1. initialize cnt and begin as 0, end as -1 (to get 0 after first increment). Then while possible perform follows:
    2. If cnt != k:

      2.1. increment end. If end already is the end of array, then break. If cur[array[end]] is zero, increment cnt. Increment cur[array[end]].

      Else:

      2.2 {

      Try to increment the begin iterator: while cur[array[begin]] > 1, decrement it, and increment the begin (cur[array[begin]] > 1 means that we have another such element in our current subsequence). After all, compare the [begin, end] interval with current answer and store it if it is better.

      }

    After the further process becomes impossible, you got the answer. The complexity is O(n) - just passing two interators through the array.

    Implementation in C++:

        #include <iostream>
    
    using namespace std;
    
    const int MAXSIZE = 10000;
    
    int arr[ MAXSIZE ];
    int cur[ MAXSIZE ];
    
    int main ()
    {
       int n; // the size of array
       // read n and the array
    
       cin >> n;
       for( int i = 0; i < n; ++i )
          cin >> arr[ i ];
    
       int k = 0;
       for( int i = 0; i < n; ++i )
       {
          if( cur[ arr[ i ] ] == 0 )
             ++k;
          ++cur[ arr[ i ] ];
       }
    
       // now k is the number of distinct elements
    
       memset( cur, 0, sizeof( cur )); // we need this array anew
       int begin = 0, end = -1; // to make it 0 after first increment
       int best = -1; // best answer currently found
       int ansbegin, ansend; // interval of the best answer currently found
       int cnt = 0; // distinct elements in current subsequence
    
       while(1)
       {
          if( cnt < k )
          {
             ++end;
             if( end == n )
                break;
             if( cur[ arr[ end ]] == 0 )
                ++cnt; // this elements wasn't present in current subsequence;
             ++cur[ arr[ end ]];
             continue;
          }
          // if we're here it means that [begin, end] interval contains all distinct elements
          // try to shrink it from behind
          while( cur[ arr[ begin ]] > 1 ) // we have another such element later in the subsequence
          {
             --cur[ arr[ begin ]];
             ++begin;
          }
          // now, compare [begin, end] with the best answer found yet
          if( best == -1 || end - begin < best )
          {
             best = end - begin;
             ansbegin = begin;
             ansend = end;
          }
          // now increment the begin iterator to make cur < k and begin increasing the end iterator again
          --cur[ arr[ begin]];
          ++begin;
          --cnt;
       }
    
       // output the [ansbegin, ansend] interval as it's the answer to the problem
    
       cout << ansbegin << ' ' << ansend << endl;
       for( int i = ansbegin; i <= ansend; ++i )
          cout << arr[ i ] << ' ';
       cout << endl;
    
       return 0;
    }
    
    0 讨论(0)
  • 2021-02-01 08:11

    above solution is correct and java version of above code

    public class MinSequence {
    
        public static void main(String[] args)
        {
            final int n; // the size of array
            // read n and the array
            final List<Integer> arr=new ArrayList<Integer>(4);
            Map<Integer, Integer> cur = new TreeMap<Integer, Integer>();
            arr.add(1);
            arr.add(2);
            arr.add(1);
            arr.add(3);
            int distinctcount=0;
            for (final Integer integer : arr)
            {
                if(cur.get(integer)==null)
                {
                    cur.put(integer, 1);
                    ++distinctcount;
                }else
                {
                    cur.put(integer,cur.get(integer)+1);
                }
            }
    
            // now k is the number of distinct elements
            cur=new TreeMap<Integer,Integer>();
            //   memset( cur, 0, sizeof( cur )); // we need this array anew
            int begin = 0, end = -1; // to make it 0 after first increment
            int best = -1; // best answer currently found
            int ansbegin = 0, ansend = 0; // interval of the best answer currently found
            int cnt = 0; // distinct elements in current subsequence
            final int inpsize = arr.size();
            while(true)
            {
                if( cnt < distinctcount )
                {
                    ++end;
                    if (end == inpsize) {
                        break;
                    }
                    if( cur.get(arr.get(end)) == null ) {
                        ++cnt;
                        cur.put(arr.get(end), 1);
                    } // this elements wasn't present in current subsequence;
                    else
                    {
                        cur.put(arr.get(end),cur.get(arr.get(end))+1);
                    }
                    continue;
                }
                // if we're here it means that [begin, end] interval contains all distinct elements
                // try to shrink it from behind
                while (cur.get(arr.get(begin)) != null && cur.get(arr.get(begin)) > 1) // we have another such element later in the subsequence
                {
                    cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
                    ++begin;
                }
                // now, compare [begin, end] with the best answer found yet
                if( best == -1 || end - begin < best )
                {
                    best = end - begin;
                    ansbegin = begin;
                    ansend = end;
                }
                // now increment the begin iterator to make cur < k and begin increasing the end iterator again
                if (cur.get(arr.get(begin)) != null) {
                    cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
                }
                ++begin;
                --cnt;
            }
    
            // output the [ansbegin, ansend] interval as it's the answer to the problem
            System.out.println(ansbegin+"--->"+ansend);
            for( int i = ansbegin; i <= ansend; ++i ) {
                System.out.println(arr.get(i));
            }
        }
    
    0 讨论(0)
  • 2021-02-01 08:12

    If you need to do this quite often for the same sequence and different sets you can use inverted lists for this. You prepare the inverted lists for the sequence and then collect all the offsets. Then scan the results from the inverted lists for a sequence of m sequential numbers.

    With n the length of the sequence and m the size of the query the preparation would be in O(n). The response time for the query would be in O(m^2) if I am not miscalculating the merge step.

    If you need more details have a look at the paper by Clausen/Kurth from 2004 on algebraic databases ("Content-Based Information Retrieval by Group Theoretical Methods"). This sketches out a general database framework that can be adapted to your task.

    0 讨论(0)
  • 2021-02-01 08:15

    This can be solved by dynamic programming.

    At each step k, we'll compute the shortest subsequence that ends at the k-th position of S and that satisfies the requirement of containing all the unique elements of S.

    Given the solution to step k (hereinafter "the sequence"), computing the solution to step k+1 is easy: append the (k+1)-th element of S to the sequence and then remove, one by one, all elements at the start of the sequence that are contained in the extended sequence more than once.

    The solution to the overall problem is the shortest sequence found in any of the steps.

    The initialization of the algorithm consists of two stages:

    1. Scan S once, building the alphabet of unique values.
    2. Find the shortest valid sequence whose first element is the first element of S; the last position of this sequence will be the initial value of k.

    All of the above can be done in O(n logn) worst-case time (let me know if this requires clarification).

    Here is a complete implementation of the above algorithm in Python:

    import collections
    
    S = [1,8,2,1,4,1,2,9,1,8,4,2,4]
    
    # initialization: stage 1
    alphabet = set(S)                         # the unique values ("symbols") in S
    count = collections.defaultdict(int)      # how many times each symbol appears in the sequence
    
    # initialization: stage 2
    start = 0
    for end in xrange(len(S)):
      count[S[end]] += 1
      if len(count) == len(alphabet):         # seen all the symbols yet?
        break
    end += 1
    
    best_start = start
    best_end = end
    
    # the induction
    while end < len(S):
      count[S[end]] += 1
      while count[S[start]] > 1:
        count[S[start]] -= 1
        start += 1
      end += 1
      if end - start < best_end - best_start: # new shortest sequence?
        best_start = start
        best_end = end
    
    print S[best_start:best_end]
    

    Notes:

    1. the data structures I use (dictionaries and sets) are based on hash tables; they have good average-case performance but can degrade to O(n) in the worst case. If it's the worst case that you care about, replacing them with tree-based structures will give the overall O(n logn) I've promised above;
    2. as pointed out by @biziclop, the first scan of S can be eliminated, making the algorithm suitable for streaming data;
    3. if the elements of S are small non-negative integers, as your comments indicate, then count can be flattened out into an integer array, bringing the overall complexity down to O(n).
    0 讨论(0)
提交回复
热议问题