Given a sequence such as S = {1,8,2,1,4,1,2,9,1,8,4}, I need to find the minimal-length subsequence that contains all element of S (no duplicates, order does n
This can be solved by dynamic programming.
At each step k
, we'll compute the shortest subsequence that ends at the k
-th position of S
and that satisfies the requirement of containing all the unique elements of S
.
Given the solution to step k
(hereinafter "the sequence"), computing the solution to step k+1
is easy: append the (k+1)
-th element of S to the sequence and then remove, one by one, all elements at the start of the sequence that are contained in the extended sequence more than once.
The solution to the overall problem is the shortest sequence found in any of the steps.
The initialization of the algorithm consists of two stages:
S
once, building the alphabet of unique values.S
; the last position of this sequence will be the initial value of k
.All of the above can be done in O(n logn)
worst-case time (let me know if this requires clarification).
Here is a complete implementation of the above algorithm in Python:
import collections
S = [1,8,2,1,4,1,2,9,1,8,4,2,4]
# initialization: stage 1
alphabet = set(S) # the unique values ("symbols") in S
count = collections.defaultdict(int) # how many times each symbol appears in the sequence
# initialization: stage 2
start = 0
for end in xrange(len(S)):
count[S[end]] += 1
if len(count) == len(alphabet): # seen all the symbols yet?
break
end += 1
best_start = start
best_end = end
# the induction
while end < len(S):
count[S[end]] += 1
while count[S[start]] > 1:
count[S[start]] -= 1
start += 1
end += 1
if end - start < best_end - best_start: # new shortest sequence?
best_start = start
best_end = end
print S[best_start:best_end]
Notes:
O(n)
in the worst case. If it's the worst case that you care about, replacing them with tree-based structures will give the overall O(n logn)
I've promised above;S
can be eliminated, making the algorithm suitable for streaming data;S
are small non-negative integers, as your comments indicate, then count
can be flattened out into an integer array, bringing the overall complexity down to O(n)
.