Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)

后端 未结 8 2073
遇见更好的自我
遇见更好的自我 2021-02-13 16:30

I need to find the longest non-overlapping repeated substring in a String. I have the suffix tree and suffix array of the string available.

When overlapping is allowed,

8条回答
  •  灰色年华
    2021-02-13 17:04

    Since I had a hard time finding a clear description of a working algorithm to obtain the longest non-overlapping repeated substrings using a suffix tree, I'd like to share the version I gathered from various sources.

    Algorithm

    1. Construct the suffix tree for the input string S (terminated with a special character that doesn't occur in S). Each leaf node corresponds to a suffix Si of S and is assigned the corresponding start position i of Si in S.
    2. Assign each node in the tree a pair (imin, imax) that indicates the minimum and maximum suffix indices at that subtree.
      1. For each leaf imin = imax = i.
      2. For each inner node imin is the minimum, imax is the maximum index of all descendant nodes' indices.
    3. For all inner nodes v, let Pv denote the string obtained by concatenating all edge labels (prefixes) on the path from the root to v. Collect all such Pv that satisfy imin + length(Pv) ≤ imax for the corresponding imin and imax at v.
    4. The longest of those Pv is the longest non-overlapping substring of S that occurs at least twice.

    Explanation

    If a substring of S occurs at least twice in S, it is the common prefix P of two suffixes Si and Sj, where i and j denote their respective start position in S. Hence, there exists an inner node v in the suffix tree for S that has two descendant leaves that correspond to i and j such that the concatenation of all edge labels of the path from the root to v is equal to P.

    The deepest such node v (in terms of the length of its corresponding prefix) marks the longest, possibly overlapping repeated substring in S. To make sure no overlapping substrings are considered, we have to make sure that P is no longer than the distance between i and j.

    We therefore calculate the minimum and the maximum indices imin and imax for each node, which correspond to the positions of the leftmost and the rightmost suffixes of S that share a common prefix. The minimum and maximum indices at a node can be easily obtained from the values of their descendants. (The indices calculation would be more complicated if we were looking for the longest substrings that occur at least k times, because then the distances of all descendants' indices had to be considered, not just two that are the farthest apart.) By considering only prefixes P that satisfy imin + length(P) ≤ imax we make sure the P starting at Si is short enough to not overlap with the suffix Sj.

    Additional notes

    • A suffix tree for strings of length n can be constructed in Θ(n) time and space. The modifications for this algortihm don't worsen the asymptotic bahavior such that the overall running time still is in Θ(n).
    • This algorithm doesn't find all possible solutions. If there are several non-overlapping longest substrings, only the substring starting at the largest position is found.
    • It should be possible to modify the algorithm to also count the number of repetitions of the longest non-overlapping substring or to find only solutions with at least k repetitions. For this not only the minimum and maximum inidices have to be considered, but the indices of all subtrees at a node. The above range condition then has to hold for each adjacent indices pair.

提交回复
热议问题