问题
I understand how the bad character heuristics work. When you find the mismatched letter x
, just shift the pattern so the rightmost x
in the pattern would be aligned with the x
in the string. And it's easy to implement in code.
I think I also understand how the good-suffix heuristics work. When we find a good suffix s
, find the same suffix in different location in the pattern and shift it so the s
in the pattern would be aligned with the s
in the string. But I don't understand how to do that in code. How do we find if the same suffix exists in another place in pattern? And how do we know its position? The code:
void bmPreprocess1()
{
int i=m, j=m+1;
f[i]=j;
while (i>0)
{
while (j<=m && p[i-1]!=p[j-1])
{
if (s[j]==0) s[j]=j-i;
j=f[j];
}
i--; j--;
f[i]=j;
}
}
from http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm doesn't make sense to me... Could someone write as simple as possible pseudo code for this task? Or explain somehow?
回答1:
First, I will use p[i]
denote a character in the pattern, m
the pattern lenght, $
the last character in the pattern, i.e., $ = p[m-1]
.
There are two scenarios for good suffix heuristics case 1.
Situation 1
Consider the following example,
leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
cXXXbXXXcXXXcXXX
^
| mismatch here
So the sub string XXX
in the pattern cXXXbXXXcXXXcXXX
is the good suffix. The mismatch occurs at character c
. So after the mismatch, should we shift 4 to the right or 8?
If we shift 4 as in the following, then the same mismath will occur again (b
mismathes c
),
leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
cXXXbXXXcXXXcXXX
^
| mismatch occurs here again
So we can actually shift 8 characters to the right in this situation.
Situation 2
Let us look at another example
leading TEXT cXXXcXXXbXXXcXXX rest of the TEXT
cXXXXcXXXbXXXcXXX
^
| mismatch happens here
Can we shift 4 or 8 or more here? Obviously we if we shift 8 or more, we will miss the opportunity to match the pattern with the text. So we can only shift 4 characters to the right in this situation.
So what is the difference between these two situations?
The difference is that in the first case, the mismatched character c
plus the good suffix XXX
, which is cXXX
, is the same as the next (count from the right) match for XXX
plus the character before that. While in the second situation, cXXX
is not the same as the next match (count from the right) plus the character before that match.
So for any given GOOD SUFFIX (let us call it XXX
) we need to figure out the shift in these two situations, (1) the character (let us call it c
) before the GOOD SUFFIX plus the GOOD SUFFIX, in the pattern is also match the next (count from the right) match of the good suffix + the character before it , (2) the character plus the GOOD SUFFIX does not match
For situation (1), for example, if you have a pattern, 0XXXcXXXcXXXcXXXcXXXcXXX
, if after the first test of c
fails, you can shift 20 characters to the right, and align 0XXX
with the text that been tested.
This is how the number 20 is calculated,
1 2
012345678901234567890123
0XXXcXXXcXXXcXXXcXXXcXXX
^ ^
The position the mismatch occurs is 20, the shifted sub string 0XXX
will take position from 20 to 23. And 0XXX
starts with position 0, so 20 - 0 = 20.
For situation (2), like in this example, 0XXXaXXXbXXXcXXX
, if after the first test of c
fails, you can shift only 4 characters to the right, and align bXXX
with the text that been tested.
This is how number 4
is calculated,
0123456789012345
0XXXaXXXbXXXcXXX
The position where the mismatch occurs is 12, the next substring to take that place is bXXX
which starts with position 8, 12 - 8 = 4. If we set j = 12
, and i = 8
, then the shift is j - i
, which is s[j] = j - i;
in the code.
Border
To consider all the good suffix, we first need to understand what is a so called border
.
A border is a substring which is both a proper
prefix and a proper
suffix of a string. For example, for a string XXXcXXX
, X
is a border, XX
is a border, XXX
too. But XXXc
is not. We need to identify the starting point of the widest border of the suffix of the pattern and this info is saved in array f[i]
.
How to determine f[i]
?
Assume we know f[i] = j
and for all other f[k]
s with i < k < m
, which means the widest border for the suffix starting from position i
started at position j
. We want to find f[i-1]
based on f[i]
.
For example, for a pattern aabbccaacc
, at postion i=4
, the suffix is ccaacc
, and the widest border for that is cc
(p[8]p[9]
), so j = f[i=4] = 8
. And now we want to know f[i-1] = f[3]
based on the info we have for f[4]
, f[5]
, ... For f[3]
, the suffix now is bccaacc
. At position, j-1=7
, it is a
!= p[4-1]
which is b
. So bcc
is not a border.
We know any border with width >= 1 of bccaacc
has to begin with b
plus the border of the suffix starting from positin j = 8
, which is cc
in this example. cc
has the widest border c
at position j = f[8]
which is 9
. So we continue our search with comparing p[4-1]
against p[j-1]
. And they are not equal again. Now the suffix is p[9] = c
and it has only zero length border at position 10
. so now j = f[9]
and it is 10
. So we continue our search with comparing p[4-1]
against p[j-1]
, they are not equal and that is the end of the string. Then f[3]
has only zero length border which make it equal to 10.
To describe the process in a more general sense
Therefore, f[i] = j
means something like this,
Position: 012345 i-1 i j - 1 j m
pattern: abcdef ... @ x ... ? x ... $
If character @
which at position i - 1
is the same as character ?
at position j - 1
, we know that
f[i - 1] = j - 1;
, or --i; --j; f[i] = j;
. The border is suffix @x ... $
starting from position j-1
.
But if character @
which at position i - 1
is different from character ?
at position j - 1
,
we have to continue our search to the right. We know two facts: (1) we know now the border width has to be smaller than the one started from position j
, i.e, smaller than x...$
. Second the border has to be begin with @...
and ends with character $
or it could be empty.
Based on these two facts, we continue our search within sub string x ... $
(from position j to m) for a border begin with x
. Then the next border should be at j
which is equal to f[j];
, i.e. j = f[j];
. Then we compare character @
with the character before x
, which is at j-1
. If they are equal, we found the border, if not, continue the process until j > m. This process is shown by the following code,
while (j<=m && p[i-1]!=p[j-1])
{
j=f[j];
}
i--; j--;
f[i]=j;
Now look at condition p[i -1] !=
p[j-1], this is what we talked about in situation (2),
p[i]matches
p[j], but
p[i -1] != p[j-1]
, so we shift from i
to j
, that that is s[j] = j - i;
.
Now the only thing left not explained is the test if (s[j] == 0)
which will occur when a shorter suffix has the same border. For example, you have a pattern
012345678
addbddcdd
When you calculate f[i - 1]
and i = 4
, you will set s[7]
. But when you calculate f[i-1]
for i = 1
, you will set s[7]
again if you don't have the test if (s[j] == 0)
. This means if you have mismatch at position 6
, you shift 3
to the right (align bdd
to the positions cdd
occupied) not 6
(not shift until add
to the positions cdd
occupied).
The comments for the code
void bmPreprocess1()
{
// initial condition, set f[m] = m+1;
int i=m, j=m+1;
f[i]=j;
// calculate f[i], s[j] from i = m to 0.
while (i>0)
{
// at this line, we know f[i], f[i+1], ... f[m].
while (j<=m && p[i-1]!=p[j-1]) // calculate the value of f[i-1] and save it to j-1
{
if (s[j]==0) s[j]=j-i; // if s[j] is not occupied, set it.
j=f[j]; // get the start position of the border of suffix p[j] ... p[m-1]
}
// assign j-1 to f[i-1]
i--; j--;
f[i]=j;
}
}
来源:https://stackoverflow.com/questions/19345263/boyer-moore-good-suffix-heuristics