What is the complexity of the algorithm is that is used to find the smallest snippet that contains all the search key words?
Here's a solution using Java 8.
static Map.Entry<Integer, Integer> documentSearch(Collection<String> document, Collection<String> query) {
Queue<KeywordIndexPair> queue = new ArrayDeque<>(query.size());
HashSet<String> words = new HashSet<>();
query.stream()
.forEach(words::add);
AtomicInteger idx = new AtomicInteger();
IndexPair interval = new IndexPair(0, Integer.MAX_VALUE);
AtomicInteger size = new AtomicInteger();
document.stream()
.map(w -> new KeywordIndexPair(w, idx.getAndIncrement()))
.filter(pair -> words.contains(pair.word)) // Queue.contains is O(n) so we trade space for efficiency
.forEach(pair -> {
// only the first and last elements are useful to the algorithm, so we don't bother removing
// an element from any other index. note that removing an element using equality
// from an ArrayDeque is O(n)
KeywordIndexPair first = queue.peek();
if (pair.equals(first)) {
queue.remove();
}
queue.add(pair);
first = queue.peek();
int diff = pair.index - first.index;
if (size.incrementAndGet() == words.size() && diff < interval.interval()) {
interval.begin = first.index;
interval.end = pair.index;
size.set(0);
}
});
return new AbstractMap.SimpleImmutableEntry<>(interval.begin, interval.end);
}
There are 2 static nested classes KeywordIndexPair
and IndexPair
, the implementation of which should be apparent from the names. Using a smarter programming language that supports tuples those classes wouldn't be necessary.
Test:
Document: apple, banana, apple, apple, dog, cat, apple, dog, banana, apple, cat, dog
Query: banana, cat
Interval: 8, 10
For all the words, maintain min and max index in case there is going to be more than one entry; if not both min and mix index will same.
import edu.princeton.cs.algs4.ST;
public class DicMN {
ST<String, Words> st = new ST<>();
public class Words {
int min;
int max;
public Words(int index) {
min = index;
max = index;
}
}
public int findMinInterval(String[] sw) {
int begin = Integer.MAX_VALUE;
int end = Integer.MIN_VALUE;
for (int i = 0; i < sw.length; i++) {
if (st.contains(sw[i])) {
Words w = st.get(sw[i]);
begin = Math.min(begin, w.min);
end = Math.max(end, w.max);
}
}
if (begin != Integer.MAX_VALUE) {
return (end - begin) + 1;
}
return 0;
}
public void put(String[] dw) {
for (int i = 0; i < dw.length; i++) {
if (!st.contains(dw[i])) {
st.put(dw[i], new Words(i));
}
else {
Words w = st.get(dw[i]);
w.min = Math.min(w.min, i);
w.max = Math.max(w.max, i);
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
DicMN dic = new DicMN();
String[] arr1 = { "one", "two", "three", "four", "five", "six", "seven", "eight" };
dic.put(arr1);
String[] arr2 = { "two", "five" };
System.out.print("Interval:" + dic.findMinInterval(arr2));
}
}
As stated, the problem is solved by a rather simple algorithm:
Just look through the input text sequentially from the very beginning and check each word: whether it is in the search key or not. If the word is in the key, add it to the end of the structure that we will call The Current Block. The Current Block is just a linear sequence of words, each word accompanied by a position at which it was found in the text. The Current Block must maintain the following Property: the very first word in The Current Block must be present in The Current Block once and only once. If you add the new word to the end of The Current Block, and the above property becomes violated, you have to remove the very first word from the block. This process is called normalization of The Current Block. Normalization is a potentially iterative process, since once you remove the very first word from the block, the new first word might also violate The Property, so you'll have to remove it as well. And so on.
So, basically The Current Block is a FIFO sequence: the new words arrive at the right end, and get removed by normalization process from the left end.
All you have to do to solve the problem is look through the text, maintain The Current Block, normalizing it when necessary so that it satisfies The Property. The shortest block with all the keywords in it you ever build is the answer to the problem.
For example, consider the text
CxxxAxxxBxxAxxCxBAxxxC
with keywords A, B and C. Looking through the text you'll build the following sequence of blocks
C
CA
CAB - all words, length 9 (CxxxAxxxB...)
CABA - all words, length 12 (CxxxAxxxBxxA...)
CABAC - violates The Property, remove first C
ABAC - violates The Property, remove first A
BAC - all words, length 7 (...BxxAxxC...)
BACB - violates The Property, remove first B
ACB - all words, length 6 (...AxxCxB...)
ACBA - violates The Property, remove first A
CBA - all words, length 4 (...CxBA...)
CBAC - violates The Property, remove first C
BAC - all words, length 6 (...BAxxxC)
The best block we built has length 4, which is the answer in this case
CxxxAxxxBxxAxx CxBA xxxC
The exact complexity of this algorithm depends on the input, since it dictates how many iterations the normalization process will make, but ignoring the normalization the complexity would trivially be O(N * log M)
, where N
is the number of words in the text and M
is the number of keywords, and O(log M)
is the complexity of checking whether the current word belongs to the keyword set.
Now, having said that, I have to admit that I suspect that this might not be what you need. Since you mentioned Google in the caption, it might be that the statement of the problem you gave in your post is not complete. Maybe in your case the text is indexed? (With indexing the above algorithm is still applicable, just becomes more efficient). Maybe there's some tricky database that describes the text and allows for a more efficient solution (like without looking through the entire text)? I can only guess and you are not saying...
This is an interesting question. To restate it more formally: Given a list L (the web page) of length n and a set S (the query) of size k, find the smallest sublist of L that contains all the elements of S.
I'll start with a brute-force solution in hopes of inspiring others to beat it. Note that set membership can be done in constant time, after one pass through the set. See this question. Also note that this assumes all the elements of S are in fact in L, otherwise it will just return the sublist from 1 to n.
best = (1,n)
For i from 1 to n-k:
Create/reset a hash found[] mapping each element of S to False.
For j from i to n or until counter == k:
If found[L[j]] then counter++ and let found[L[j]] = True;
If j-i < best[2]-best[1] then let best = (i,j).
Time complexity is O((n+k)(n-k)). Ie, n^2-ish.
I think the solution proposed by AndreyT assumes no duplicates exists in the keywords/search terms. Also, the current block can get as big as the text itself if text contains lot of duplicate keywords. For example: Text: 'ABBBBBBBBBB' Keyword text: 'AB' Current Block: 'ABBBBBBBBBB'
Anyway, I have implemented in C#, did some basic testing, would be nice to get some feedback on whether it works or not :)
static string FindMinWindow(string text, string searchTerms)
{
Dictionary<char, bool> searchIndex = new Dictionary<char, bool>();
foreach (var item in searchTerms)
{
searchIndex.Add(item, false);
}
Queue<Tuple<char, int>> currentBlock = new Queue<Tuple<char, int>>();
int noOfMatches = 0;
int minLength = Int32.MaxValue;
int startIndex = 0;
for(int i = 0; i < text.Length; i++)
{
char item = text[i];
if (searchIndex.ContainsKey(item))
{
if (!searchIndex[item])
{
noOfMatches++;
}
searchIndex[item] = true;
var newEntry = new Tuple<char, int> ( item, i );
currentBlock.Enqueue(newEntry);
// Normalization step.
while (currentBlock.Count(o => o.Item1.Equals(currentBlock.First().Item1)) > 1)
{
currentBlock.Dequeue();
}
// Figuring out minimum length.
if (noOfMatches == searchTerms.Length)
{
var length = currentBlock.Last().Item2 - currentBlock.First().Item2 + 1;
if (length < minLength)
{
startIndex = currentBlock.First().Item2;
minLength = length;
}
}
}
}
return noOfMatches == searchTerms.Length ? text.Substring(startIndex, minLength) : String.Empty;
}