Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)

后端 未结 8 2054
遇见更好的自我
遇见更好的自我 2021-02-13 16:30

I need to find the longest non-overlapping repeated substring in a String. I have the suffix tree and suffix array of the string available.

When overlapping is allowed,

相关标签:
8条回答
  • 2021-02-13 17:04

    By constructing a suffix tree, all suffixes sharing a prefix P will be descendants of a common ancestor in the tree. By storing the maximum and minimum index of the the suffixes of that sub tree, we can guarantee a repeated non-overlapping substring of length min(depth, max-min) where max-min is the distance between them and depth is the length of their common prefix. The desired value is the node with maximum such value.

    0 讨论(0)
  • 2021-02-13 17:04

    The simplest solution is something of a brute force attack. You have an algorithm to find the longest overlapping-allowed string, use it, check if that answer has overlaps, if so, find the second longest, check and see if it has overlaps, and so on. That reduces it to your existing search algorithm, then a regex count operation.

    0 讨论(0)
  • 2021-02-13 17:04

    Since I had a hard time finding a clear description of a working algorithm to obtain the longest non-overlapping repeated substrings using a suffix tree, I'd like to share the version I gathered from various sources.

    Algorithm

    1. Construct the suffix tree for the input string S (terminated with a special character that doesn't occur in S). Each leaf node corresponds to a suffix Si of S and is assigned the corresponding start position i of Si in S.
    2. Assign each node in the tree a pair (imin, imax) that indicates the minimum and maximum suffix indices at that subtree.
      1. For each leaf imin = imax = i.
      2. For each inner node imin is the minimum, imax is the maximum index of all descendant nodes' indices.
    3. For all inner nodes v, let Pv denote the string obtained by concatenating all edge labels (prefixes) on the path from the root to v. Collect all such Pv that satisfy imin + length(Pv) ≤ imax for the corresponding imin and imax at v.
    4. The longest of those Pv is the longest non-overlapping substring of S that occurs at least twice.

    Explanation

    If a substring of S occurs at least twice in S, it is the common prefix P of two suffixes Si and Sj, where i and j denote their respective start position in S. Hence, there exists an inner node v in the suffix tree for S that has two descendant leaves that correspond to i and j such that the concatenation of all edge labels of the path from the root to v is equal to P.

    The deepest such node v (in terms of the length of its corresponding prefix) marks the longest, possibly overlapping repeated substring in S. To make sure no overlapping substrings are considered, we have to make sure that P is no longer than the distance between i and j.

    We therefore calculate the minimum and the maximum indices imin and imax for each node, which correspond to the positions of the leftmost and the rightmost suffixes of S that share a common prefix. The minimum and maximum indices at a node can be easily obtained from the values of their descendants. (The indices calculation would be more complicated if we were looking for the longest substrings that occur at least k times, because then the distances of all descendants' indices had to be considered, not just two that are the farthest apart.) By considering only prefixes P that satisfy imin + length(P) ≤ imax we make sure the P starting at Si is short enough to not overlap with the suffix Sj.

    Additional notes

    • A suffix tree for strings of length n can be constructed in Θ(n) time and space. The modifications for this algortihm don't worsen the asymptotic bahavior such that the overall running time still is in Θ(n).
    • This algorithm doesn't find all possible solutions. If there are several non-overlapping longest substrings, only the substring starting at the largest position is found.
    • It should be possible to modify the algorithm to also count the number of repetitions of the longest non-overlapping substring or to find only solutions with at least k repetitions. For this not only the minimum and maximum inidices have to be considered, but the indices of all subtrees at a node. The above range condition then has to hold for each adjacent indices pair.
    0 讨论(0)
  • 2021-02-13 17:20

    This could be solved using results given in "Computing Longest Previous non-overlapping Factors" (see http://dx.doi.org/10.1016/j.ipl.2010.12.005 )

    0 讨论(0)
  • 2021-02-13 17:24

    Unfortunately, the solution proposed by Perkins will not work. We can't brute force our way through solutions to find a long repeated non-overlapping substring. Consider the suffix tree for banana: http://en.wikipedia.org/wiki/Suffix_tree. The "NA" branching node with "A" as its parent will be considered first, since it has the biggest length and is a branching node. But its constructed string "ANA" is overlapping, so it will be rejected. Now, the next node to consider with be "NA" which will show a non-overlapping length of 2, but substring "AN" will never be considered since it was already represented in the ANA string already considered. So if you're searching for all repeated non-overlapping substrings, or when there's a tie you want the first alphabetical one, you're out of luck.

    Apparently there is an approach involving suffix trees that works, but the simpler approach is laid out here: http://rubyquiz.com/quiz153.html

    Hope this helps!

    0 讨论(0)
  • 2021-02-13 17:24

    We use the longest common prefix (LCP) array and suffix array to solve this problem in O(n log n) time.

    The LCP array gives us the longest common prefix between two consecutive suffixes in the suffix array.

    After constructing the LCP array and the suffix array, we can binary search for the length of the answer.

    Suppose the string is "acaca$". The suffix array is given in the code snippet as a table.

    <table border="1">
    <tr><th>Suffix Array index</th><th>LCP</th><th>Suffix (implicit)</th></tr>
    <tr><td>5</td><td>-1</td><td>$</td></tr>
    <tr><td>4</td><td>0</td><td>a$</td></tr>
    <tr><td>2</td><td>1</td><td>aca$</td></tr>
    <tr><td>0</td><td>3</td><td>acaca$</td></tr>
    <tr><td>3</td><td>0</td><td>ca$</td></tr>
    <tr><td>1</td><td>2</td><td>caca$</td></tr>
    </table>

    Let's binary search for the length of the answer.

    If we have a certain answer, let the two substrings correspond to two suffixes.

    There is no guarantee that these suffixes are consecutive in the suffix array. However, if we know the length of the substring, we can see that every entry in the LCP table between the two suffixes of the substrings is at least that number. Also, the difference between the indices of the two suffices must be at least that number.

    Guessing that the length of the substring is a certain amount, we can consider consecutive runs of LCP array entries which are at least that amount. In each consecutive run, find the suffix with the largest and smallest index.

    How do we know our guess is a lower bound?

    If the distance between the largest and smallest index in some [consecutive runs of LCP array entries which are at least our guess] is at least our guess, then, our guess is an attainable lower bound.

    How do we know our guess is too big?

    If the distance between the largest and smallest index in all [consecutive runs of LCP array entries which are at least our guess] is smaller than our guess, then, our guess is too big.

    How do we find the answer given the length of the answer?

    For each [consecutive runs of LCP array entries which are at least the answer], find the lowest and highest indices. If they differ by at least the answer, then we return that the longest non-overlapping repeated substrings begin at these indices.

    In your example, "acaca$", we can find that the length of the answer is 2.

    All the runs are: "aca$", "acaca$", and the distance between the lower and higher indices is 2, resulting in the repeated substring "ac".

    "caca$", "ca$", and the distance between the lower and higher indices is 2, resulting in the repeated substring "ca".

    0 讨论(0)
提交回复
热议问题