Understanding the algorithm for pattern matching using an LCP array

£可爱£侵袭症+ 提交于 2019-12-20 03:52:02

问题


Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me.

In this paper it is described how to efficiently use suffix and LCP arrays for string pattern matching.

I understood SA and LCP work and how the algorithm's runtime can be improved from O(P*log(N)) (where P is the length of the pattern and N is length of the string) to O(P+log(N)) (Thanks to Chris Eelmaa's answer here and jogojapans answer here).

I was trying to go through the algorithm in figure 4 which explains the usage of LLcp and RLcp. But I have problems understanding how it works.

The algorithm (taken from the source):

Explanation of the used variable names:

lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle

Now I want to try the algorithm using the following example (partly taken from here):

 SA | LCP | Suffix entry
 -----------------------
 5  | N/A | a  
 3  | 1   | ana  
 1  | 3   | anana  
 0  | 0   | banana  
 4  | 0   | na  
 2  | 2   | nana 

A = "banana" ; N = 6
W = "ban" ; P = 3

I want to try to match a string, say ban and would expect the algorithm to return 0 as L_W.

Here is how I would step through the algorithm:

l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
    L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions 
    L_W = 6 // which means 'not found'
...
...

I feel like I am missing something but I can't find out what. Also I am wondering how the precomputed LCP array can be used instead of calling lcp(v,w).


回答1:


I believe there was an error.

First condition is fairly easy to understand. When LCP length == pattern length, it's done. When your pattern is even smaller than or equal to the smallest one, then only choice is the smallest one.

The second condition is wrong. We can prove it by contradiction. r < P || Wr <= a... means r >= P && Wr > a... If r >= P, then how can we have Lw = N(not found), since we already have r length common prefix?



来源:https://stackoverflow.com/questions/28444226/understanding-the-algorithm-for-pattern-matching-using-an-lcp-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!