Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)

后端未结

关注

 8  2058

遇见更好的自我 2021-02-13 16:30

I need to find the longest non-overlapping repeated substring in a String. I have the suffix tree and suffix array of the string available.

When overlapping is allowed,

8条回答

囚心锁ツ (楼主)

2021-02-13 17:24
We use the longest common prefix (LCP) array and suffix array to solve this problem in O(n log n) time.

The LCP array gives us the longest common prefix between two consecutive suffixes in the suffix array.

After constructing the LCP array and the suffix array, we can binary search for the length of the answer.

Suppose the string is "acaca$". The suffix array is given in the code snippet as a table.
```
Suffix Array index LCP Suffix (implicit)
5 -1 $
4 0 a$
2 1 aca$
0 3 acaca$
3 0 ca$
1 2 caca$
```
Let's binary search for the length of the answer.

If we have a certain answer, let the two substrings correspond to two suffixes.

There is no guarantee that these suffixes are consecutive in the suffix array. However, if we know the length of the substring, we can see that every entry in the LCP table between the two suffixes of the substrings is at least that number. Also, the difference between the indices of the two suffices must be at least that number.

Guessing that the length of the substring is a certain amount, we can consider consecutive runs of LCP array entries which are at least that amount. In each consecutive run, find the suffix with the largest and smallest index.

How do we know our guess is a lower bound?

If the distance between the largest and smallest index in some [consecutive runs of LCP array entries which are at least our guess] is at least our guess, then, our guess is an attainable lower bound.

How do we know our guess is too big?

If the distance between the largest and smallest index in all [consecutive runs of LCP array entries which are at least our guess] is smaller than our guess, then, our guess is too big.

How do we find the answer given the length of the answer?

For each [consecutive runs of LCP array entries which are at least the answer], find the lowest and highest indices. If they differ by at least the answer, then we return that the longest non-overlapping repeated substrings begin at these indices.

In your example, "acaca$", we can find that the length of the answer is 2.

All the runs are: "aca$", "acaca$", and the distance between the lower and higher indices is 2, resulting in the repeated substring "ac".

"caca$", "ca$", and the distance between the lower and higher indices is 2, resulting in the repeated substring "ca".
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...

Suffix Array index	LCP	Suffix (implicit)
5	-1	$
4	0	a$
2	1	aca$
0	3	acaca$
3	0	ca$
1	2	caca$