Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)

后端未结

关注

 8  2073

遇见更好的自我 2021-02-13 16:30

I need to find the longest non-overlapping repeated substring in a String. I have the suffix tree and suffix array of the string available.

When overlapping is allowed,

8条回答

灰色年华 (楼主)

2021-02-13 17:04
Since I had a hard time finding a clear description of a working algorithm to obtain the longest non-overlapping repeated substrings using a suffix tree, I'd like to share the version I gathered from various sources.

Algorithm
1. Construct the suffix tree for the input string S (terminated with a special character that doesn't occur in S). Each leaf node corresponds to a suffix S_i of S and is assigned the corresponding start position i of S_i in S.
2. Assign each node in the tree a pair (i_min, i_max) that indicates the minimum and maximum suffix indices at that subtree.
  1. For each leaf i_min = i_max = i.
  2. For each inner node i_min is the minimum, i_max is the maximum index of all descendant nodes' indices.
3. For all inner nodes v, let P_v denote the string obtained by concatenating all edge labels (prefixes) on the path from the root to v. Collect all such P_v that satisfy i_min + length(P_v) ≤ i_max for the corresponding i_min and i_max at v.
4. The longest of those P_v is the longest non-overlapping substring of S that occurs at least twice.
Explanation

If a substring of S occurs at least twice in S, it is the common prefix P of two suffixes S_i and S_j, where i and j denote their respective start position in S. Hence, there exists an inner node v in the suffix tree for S that has two descendant leaves that correspond to i and j such that the concatenation of all edge labels of the path from the root to v is equal to P.

The deepest such node v (in terms of the length of its corresponding prefix) marks the longest, possibly overlapping repeated substring in S. To make sure no overlapping substrings are considered, we have to make sure that P is no longer than the distance between i and j.

We therefore calculate the minimum and the maximum indices i_min and i_max for each node, which correspond to the positions of the leftmost and the rightmost suffixes of S that share a common prefix. The minimum and maximum indices at a node can be easily obtained from the values of their descendants. (The indices calculation would be more complicated if we were looking for the longest substrings that occur at least k times, because then the distances of all descendants' indices had to be considered, not just two that are the farthest apart.) By considering only prefixes P that satisfy i_min + length(P) ≤ i_max we make sure the P starting at S_i is short enough to not overlap with the suffix S_j.

Additional notes
- A suffix tree for strings of length n can be constructed in Θ(n) time and space. The modifications for this algortihm don't worsen the asymptotic bahavior such that the overall running time still is in Θ(n).
- This algorithm doesn't find all possible solutions. If there are several non-overlapping longest substrings, only the substring starting at the largest position is found.
- It should be possible to modify the algorithm to also count the number of repetitions of the longest non-overlapping substring or to find only solutions with at least k repetitions. For this not only the minimum and maximum inidices have to be considered, but the indices of all subtrees at a node. The above range condition then has to hold for each adjacent indices pair.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...