Boyer-Moore good-suffix heuristics

前端 未结 1 1764
有刺的猬
有刺的猬 2021-02-01 06:41

I understand how the bad character heuristics work. When you find the mismatched letter x, just shift the pattern so the rightmost x in the pattern wou

相关标签:
1条回答
  • 2021-02-01 07:14

    First, I will use p[i] denote a character in the pattern, m the pattern lenght, $ the last character in the pattern, i.e., $ = p[m-1].

    There are two situations for good suffix heuristics case 1.

    Situation 1

    Consider the following example,

        leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
             cXXXbXXXcXXXcXXX
                         ^ 
                         | mismatch here
    

    So the sub string XXX in the pattern cXXXbXXXcXXXcXXX is the good suffix. The mismatch occurs at character c. So after the mismatch, should we shift 4 to the right or 8?

    If we shift 4 as in the following, then the same mismath will occur again (b mismathes c),

        leading TEXT cXXXbXXXcXXXcXXX rest of the TEXT
                 cXXXbXXXcXXXcXXX
                         ^ 
                         | mismatch occurs here again
    

    So we can actually shift 8 characters to the right in this situation.

    Situation 2

    Let us look at another example

        leading TEXT cXXXcXXXbXXXcXXX rest of the TEXT
                cXXXXcXXXbXXXcXXX
                             ^ 
                             | mismatch happens here
    

    Can we shift 4 or 8 or more here? Obviously we if we shift 8 or more, we will miss the opportunity to match the pattern with the text. So we can only shift 4 characters to the right in this situation.

    So what is the difference between these two situations?

    The difference is that in the first case, the mismatched character c plus the good suffix XXX, which is cXXX, is the same as the next (count from the right) match for XXX plus the character before that. While in the second situation, cXXX is not the same as the next match (count from the right) plus the character before that match.

    So for any given GOOD SUFFIX (let us call it XXX) we need to figure out the shift in these two situations, (1) the character (let us call it c) before the GOOD SUFFIX plus the GOOD SUFFIX, in the pattern is also match the next (count from the right) match of the good suffix + the character before it , (2) the character plus the GOOD SUFFIX does not match

    For situation (1), for example, if you have a pattern, 0XXXcXXXcXXXcXXXcXXXcXXX, if after the first test of c fails, you can shift 20 characters to the right, and align 0XXX with the text that been tested.

    This is how the number 20 is calculated,

                  1         2
        012345678901234567890123
        0XXXcXXXcXXXcXXXcXXXcXXX
        ^                   ^
    

    The position the mismatch occurs is 20, the shifted sub string 0XXX will take position from 20 to 23. And 0XXX starts with position 0, so 20 - 0 = 20.

    For situation (2), like in this example, 0XXXaXXXbXXXcXXX, if after the first test of c fails, you can shift only 4 characters to the right, and align bXXX with the text that been tested.

    This is how number 4 is calculated,

        0123456789012345
        0XXXaXXXbXXXcXXX
    

    The position where the mismatch occurs is 12, the next substring to take that place is bXXX which starts with position 8, 12 - 8 = 4. If we set j = 12, and i = 8, then the shift is j - i, which is s[j] = j - i; in the code.

    Border

    To consider all the good suffix, we first need to understand what is a so called border. A border is a substring which is both a proper prefix and a proper suffix of a string. For example, for a string XXXcXXX, X is a border, XX is a border, XXX too. But XXXc is not. We need to identify the starting point of the widest border of the suffix of the pattern and this info is saved in array f[i].

    How to determine f[i] ?

    Assume we know f[i] = j and for all other f[k]s with i < k < m, which means the widest border for the suffix starting from position i started at position j. We want to find f[i-1] based on f[i].

    For example, for a pattern aabbccaacc, at postion i=4, the suffix is ccaacc, and the widest border for that is cc (p[8]p[9]), so j = f[i=4] = 8. And now we want to know f[i-1] = f[3] based on the info we have for f[4], f[5], ... For f[3], the suffix now is bccaacc. At position, j-1=7, it is a != p[4-1] which is b. So bcc is not a border.

    We know any border with width >= 1 of bccaacc has to begin with b plus the border of the suffix starting from positin j = 8, which is cc in this example. cc has the widest border c at position j = f[8] which is 9. So we continue our search with comparing p[4-1] against p[j-1]. And they are not equal again. Now the suffix is p[9] = c and it has only zero length border at position 10. so now j = f[9] and it is 10. So we continue our search with comparing p[4-1] against p[j-1], they are not equal and that is the end of the string. Then f[3] has only zero length border which make it equal to 10.

    To describe the process in a more general sense

    Therefore, f[i] = j means something like this,

        Position:    012345     i-1  i     j - 1   j       m
        pattern:     abcdef ... @    x ... ?       x ...   $ 
    

    If character @ at position i - 1 is the same as character ? at position j - 1, we know that f[i - 1] = j - 1; , or --i; --j; f[i] = j;. The border is suffix @x ... $ starting from position j-1.

    But if character @ at position i - 1 is different from character ? at position j - 1, we have to continue our search to the right. We know two facts: (1) we know now the border width has to be smaller than the one started from position j, i.e, smaller than x...$. Second the border has to be begin with @... and ends with character $ or it could be empty.

    Based on these two facts, we continue our search within sub string x ... $ (from position j to m) for a border begin with x. Then the next border should be at j which is equal to f[j];, i.e. j = f[j];. Then we compare character @ with the character before x, which is at j-1. If they are equal, we found the border, if not, continue the process until j > m. This process is shown by the following code,

        while (j<=m && p[i-1]!=p[j-1])
        {
            j=f[j];
        }
    
        i--; j--;
        f[i]=j;
    

    Now look at condition p[i -1] != p[j-1], this is what we talked about in situation (2), p[i]matchesp[j], but p[i -1] != p[j-1], so we shift from i to j, that that is s[j] = j - i;.

    Now the only thing left not explained is the test if (s[j] == 0) which will occur when a shorter suffix has the same border. For example, you have a pattern

        012345678
        addbddcdd
    

    When you calculate f[i - 1] and i = 4, you will set s[7]. But when you calculate f[i-1] for i = 1, you will set s[7] again if you don't have the test if (s[j] == 0). This means if you have mismatch at position 6, you shift 3 to the right (align bdd to the positions cdd occupied) not 6 (not shift until add to the positions cdd occupied).

    The comments for the code

        void bmPreprocess1()
        {
            // initial condition, set f[m] = m+1;
            int i=m, j=m+1;
            
            f[i]=j;
    
            // calculate f[i], s[j] from i = m to 0.
            while (i>0)
            {
                // at this line, we know f[i], f[i+1], ... f[m].
                while (j<=m && p[i-1]!=p[j-1]) // calculate the value of f[i-1] and save it to j-1
                {
                    if (s[j]==0) s[j]=j-i; // if s[j] is not occupied, set it.
                    j=f[j]; // get the start position of the border of suffix p[j] ... p[m-1]
                }
                // assign j-1 to f[i-1]
                i--; j--;
                f[i]=j;
            }
        }
    
    0 讨论(0)
提交回复
热议问题