I came across this question on an interview questions thread. Here is the question:
Given two integer arrays A [1..n] and B[1..m], find the smallest<
The following is provably optimal up to a logarithmic factor. (I believe the log factor cannot be got rid of, and so it's optimal.)
Variant 1 is just a special case of variant 2 with all the multiplicities being 1, after removing duplicates from B. So it's enough to handle the latter variant; if you want variant 1, just remove duplicates in O(m log m)
time. In the following, let m
denote the number of distinct elements in B. We assume m < n
, because otherwise we can just return -1
, in constant time.
For each index i
in A, we will find the smallest index s[i]
such that A[i..s[i]]
contains B[1..m]
, with the right multiplicities. The crucial observation is that s[i]
is non-decreasing, and this is what allows us to do it in amortised linear time.
Start with i=j=1
. We will keep a tuple (c[1], c[2], ... c[m])
of the number of times each element of B occurs, in the current window A[i..j]
. We will also keep a set S
of indices (a subset of 1..m
) for which the count is "right" (i.e., k
for which c[k]=1
in variant 1, or c[k] =
in variant 2).
So, for i=1
, starting with j=1
, increment each c[A[j]]
(if A[j]
was an element of B), check if c[A[j]]
is now "right", and add or remove j
from S
accordingly. Stop when S
has size m
. You've now found s[1]
, in at most O(n log m)
time. (There are O(n)
j
's, and each set operation took O(log m)
time.)
Now for computing successive s[i]
s, do the following. Increment i
, decrement c[A[i]]
, update S
accordingly, and, if necessary, increment j
until S
has size m
again. That gives you s[i]
for each i
. At the end, report the (i,s[i])
for which s[i]-i
was smallest.
Note that although it seems that you might be performing up to O(n)
steps (incrementing j
) for each i
, the second pointer j
only moves to the right: so the total number of times you can increment j
is at most n
. (This is amortised analysis.) Each time you increment j
, you might perform a set operation that takes O(log m)
time, so the total time is O(n log m)
. The space required was for keeping the tuple of counts, the set of elements of B, the set S, and some constant number of other variables, so O(m)
in all.
There is an obvious O(m+n)
lower bound, because you need to examine all the elements. So the only question is whether we can prove the log
factor is necessary; I believe it is.