Algorithm to find the smallest snippet from searching a document?

前端未结

关注

 7  1434

太阳男子 2021-01-30 07:57

I\'ve been going through Skiena\'s excellent \"The Algorithm Design Manual\" and got hung up on one of the exercises.

The question is: \"Given a search string of three w

7条回答

被撕碎了的回忆 (楼主)

2021-01-30 08:07
From the question, it seems that you're given the index locations for each of your n “search words” (word1, word2, word3, ..., word n) in the document. Using a sorting algorithm, the n independent arrays associated with search words can readily be represented as a single array of all the index locations in ascending numerical order and a word label associated with each index in the array (the index array).

The Basic Algorithm:

(Designed to work whether or not the poster of this question intended to allow two different search words to coexist at the same index number.)

First, we define a simple function for measuring the length of a snippet that contains all n labels given a starting point in the index array. (It is obvious from the definition of our array that any starting point on the array will necessarily be the indexed location of one of the n search labels.) The function simply keeps track of the unique search labels seen as the function iterates through the elements in the array until all n labels have been observed. The length of the snippet is defined as the difference between the index of the last unique label found and the index of the starting point in the index array (the first unique label found). If all n labels aren't observed before the end of the array the function returns a null value.

Now, the snippet length function can be run for each element in your array to associate a snippet size containing all n search words starting from each element in the array. The smallest non-Null value returned by the snippet length function over the whole index array is the snippet in your document that you're looking for.

Necessary Optimizations:
1. Keep track of the value of the current shortest snippet length so that the value will be know immediately after iterating once through the index array.
2. When iterating through your array terminate the snippet length function if the current snippet under inspection ever surpasses the length of the shortest snippet length previously seen.
3. When the snippet length function returns null for not locating all n search words in the remaining index array elements, associate a null snippet length to all successive elements in the index array.
4. If the snippet length function is applied to a word label and the label immediately following it is identical to the starting label, assign a null value to the starting label and move on to the next label.
Computational Complexity:

Obviously the sorting part of the algorithm can be arranged in O(n log n).

Here's how I would work out the time complexity of the second part of the algorithm (any critiques and corrections would be greatly appreciated).

In the best case scenario, the algorithm only applies the snippet length function to the first element in the index array and finds that no snippet containing all the search words exists. This scenario would be computed in just n calculations where n is the size of the index array. Slightly worse than that is if the smallest snippet turns out to be equal to the size of the whole array. In this case the computational complexity will be a little less than 2 n (once through the array to find the smallest snippet length, a second time to demonstrate that no other snippets exist). The shorter the average computed snippet length, the more times the snippet length function will need to be applied over the index array. We can assume that our worse case scenario will be the case where the snippet length function needs to be applied to every element in the index array. To develop a case where the function will be applied to every element in the index array we need to design an index array where the average snippet length over the whole index array is negligible in comparison to the size of the index array as a whole. Using this case we can write out our computational complexity as O(C n) where C is some constant that is significantly smaller then n. Giving a final computational complexity of:

O(n log n + C n)

Where:

C << n

Edit:

AndreyT correctly points out that instead of sorting the word indicies in n log n time, one might just as well merge them (since the sub arrays are already sorted) in n log m time where m is the amount of search word arrays to be merged. This will obviously speed up the algorithm is cases where m < n.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...