How does LCP help in finding the number of occurrences of a pattern?

断了今生、忘了曾经 提交于 2019-11-28 17:57:28

I do not know any way of using the LCP array instead of carrying out a binary search, but I believe what you refer to is the technique described by Udi Manber and Gene Myers in Suffix arrays: a new method for on-line string searches.

The idea is this: In order to find the number of occurrences of a given string P (length m) in a text T (length N),

  • You use binary search against the suffix array of T (just like you suggested)
  • But you speed it up using the LCP array as auxiliary data structure. More specifically, you generate a special version of the LCP array (I will call it LCP-LR below) and use that.

The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons you need to make, you compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N).

The LCP-LR array helps improve this to O(m+log N), in the following way:

  • At any point during the binary search algorithm, you consider, as usual, a range (L,...,R) of the suffix array and its central point M, and decide whether you continue your search in the left sub-range (L,...,M) or in the right sub-range (M,...,R).
  • In order to make the decision, you compare P to the string at M. If P is identical to M, you are done, but if not, you will have compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M.
  • So, in the next step, you consider (M,...,R) and a new central point M' in the middle:

                  M ...... M' ...... R
                  |
           we know:
              lcp(P,M)==k
    

    The trick now is that LCP-LR is precomputed such that a O(1)-lookup tells you the longest common prefix of M and M', lcp(M,M').

    You know already (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:

    • Case 1: k < lcp(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
    • Case 2: k > lcp(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,...,M').
    • Case 3: k == lcp(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)-th character.
  • We continue recursively.

The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+log N).

Obviously, the key remaining question is how did we precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array? As you said, the standard LCP array tells you the lcp of consecutive entries only, i.e. lcp(x-1,x) for any x. But M and M' in the description above are not necessarily consecutive entries, so how is that done?

The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. If you think of it: Every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size.

Moreover, there is a straight-forward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array – I'd suggest posting a separate question if you need a detailed description of that.

To sum up:

  • It is possible to compute LCP-LR in O(N) time and O(2*N)=O(N) space from LCP
  • Using LCP-LR during binary search helps accelerate the search procedure from O(M*log N) to O(M+log N)
  • As you suggested, you can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.

The Longest Common Prefix (LCP) is the Lowest Common Ancestor (LCA) in a suffix tree. Once you have the Lowest Common Ancestor, you can count the number of nodes that branch out from the LCA. This will give you the number of occurrences of a pattern in the suffix tree. This is the relationship between the LCP and LCA.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!