Boyer-Moore字符串搜索算法是一种非常高效的字符串搜索算法。它由Bob Boyer和J Strother Moore设计于1977年,最初的定义1975年就给出了,后续才给出构造算法以及算法证明。
先假定部分定义:
1、pattern 为模式字符串,长度为patLen;
2、Text为目标查找字符串,长度为n;
2、当前不匹配字符在pattern中位置为 j(0≤ j ≤patLen -1);
3、已经匹配的长度为 m(0≤ m <patLen);
4、先假设不匹配字符在pattern中位置为 Δ(*),其中*可以是任何字符;
很多资料里面讲解原理时说的数组位置都是从1开始的,这里为了好理解code,都是从0开始;
首先来看下坏字符规则:
一、坏字符规则(bad character rule ):让不匹配字符和pattern中最右边出现的该字符对齐匹配,如果没有则全部跳过;
>假设1:遇到不匹配字符,如果该字符在pattern 中不存在,有:(如下图示跳转)
字符指针右移:patLen 长度 后和 pattern 右对齐;
Pattern 右移:patLen – m;
>假设2:遇到不匹配字符,如果该字符在pattern 中存在,这里也分两种情况:
a>.在pattern最右边出现的该字符在当前不匹配字符左边,有:(如下图示跳转)
字符指针右移:j–Δ(‘-’) +m = (j + m)–Δ(‘-’) = (patlen – 1) -Δ(‘-’) = (7-1)-2 = 4
Pattern 右移:字符指针偏移 - m = 4 – m = 2;
b>.在pattern中最右边出现的该字符在当前不匹配字符右边,有:(如下图示跳转)
字符指针右移: (patlen-1) – Δ(‘T’) = (7-1) – 6 = 0
Pattern右移:字符指针偏移 – m = 0 – 2 = -2
可以看出,pattern 竟然回退比较了,这是不应该出现的,这时候直接往后移动1位就行了:
总结上面三种情况,我们定义坏字符函数delta1() 为字符指针的偏移:
Delta1($) = patLen;(不匹配字符在pattern中不存在)
= patLen–1-Δ(*);(不匹配字符存在pattern中,且在pattern中最右边出现的位置在当前不匹配字符左边)
= 1;( 不匹配字符存在pattern中,且在pattern中最右边出现的该字符在当前不匹配字符右边)
二、好后缀规则(good suffix rule):根据已经匹配的部分字串(subpat),在pattern中寻找是否有和 subpat 全部或者部分匹配的字串,直接对齐匹配,避免无效的移动;
先约定几点:
1、 假设 $ 为pattern中没有出现过的字符,有pat[i] = $ 当i < 0;
2、 两个序列[C1 … Cn] 和[d1… dn] 是一致的, 当且仅且cj = dj 或者 cj = $ 或者 dj = $;其中(0≤j<n)
3、 最右边可能重新出现的subpat (p[j+1 ~ patLen-1])的位置为rpr(j)(rightmost plausible reoccurrence), 是使[pat[j + 1] ... pat[patlen]] 和 [pat[k] ... pat[k + patlen - j – 1] ]一致的最大K值,其中k≤0 或者pat[k – 1] != pat[j].
上图写出了pattern “ABXYCDEXY” 的rpr()值计算结果:我们来解析下
a>.当j = 8 时,已经匹配字串p[j+1 … patLen-1] 为空,参照rpr()定义,可知,pattern最右边可能和空串一致的,就是p[8 ~ PatLen-1], 可知rpr(8) = 8.
b>.当j = 7时,已经匹配字串subpat为”Y”, 可以看到p[3 ~ 3] = subpat , 此时k=3>0, 但是pat(k-1) == pat[j] = “X”不满足条件,再往右找,可以知道该 subpat 只可能存在 pattern 头部-1位置,即rpr(7) = -1.
c>.当j = 6 时,已经匹配字串subpat为”XY”, 可以看到p[2 ~ 3] = subpat, 同时满足p[k-1] != pat[j] ,可知rpr(6) = 2.
d>.当j = 5 时,已经匹配字串subpat为”EXY”, pattern中没有对应字串和subpat一致,只可能存在pattern头部,可知rpr(5) = -3;
其他情况依次类推,上面的几种情况应该包含了所有的rpr() 求法,从上面分析可以得出个规律:
rpr[patLen-1] = patLen-1.
可以得出 good suffix rule 的偏移值, 让pat[k] 和 pat[j+1] 对齐匹配:
Pattern 右移:j + 1 - rpr(j)
字符指针右移: m + j + 1 - rpr(j) = (patLen - 1 - j) + j + 1 – rpr(j) = patLen – rpr(j)
下面我们定义好后缀规则偏移算法:
delta2(j) = patLen - rpr(j); (0≤j<patLen)
*读者如果有看过别的BM算法资料,有地方 delta2(j) = patLen – 1 – rpr(j), 还是开头的这句话,我们这里数组索引从0开始,所以rpr(j) 的值也比索引从1开始的小1;
下面给出完整的实现代码:
#include <string.h> // strlen()
#include <stdlib.h> // __max()
#define ALPHABET_SIZE (1 << (sizeof(char)*8))
// Enable any/all to trace intermediate results
//#define TRACE_DELTA1
//#define TRACE_DELTA2
//#define TRACE_BM
#if defined TRACE_DELTA1 || defined TRACE_DELTA2 || defined TRACE_BM
#include <stdio.h>
#include <ctype.h>
#endif
void calc_delta1(const char *pat, int patlen, int delta1[])
{
int j = 0;
for (j = 0; j < ALPHABET_SIZE; j++)
delta1[j] = patlen;
for (j = 0; j < patlen; j++)
{
// By scanning pat from left to right, the final
// value in delta1[char] is the *rightmost* occurrence of
// char in pat
delta1[pat[j]] = patlen - 1 - j;
}
#ifdef TRACE_DELTA1
printf("Starting dump delta1[]>>>>>>>>>>>>>>>>>>>>>>>>>\n");
for (j = 0; j < ALPHABET_SIZE; j++)
{
if (delta1[j] != patlen)
{
printf(" %c:%d\n", (char)j, delta1[j]);
}
}
printf(" others:%d\n", patlen);
#endif
}
void calc_delta2(const char *pat, int patlen, int * delta2)
{
int i = 0, j = 0, s = 0, m = 0, n = 0;
// rpr[j] : where we can find rightmost plausible recurrence of pat[j+1 .. patlen-1]
int *rpr = new int[patlen];
// Mark each uninitialized rpr value with a large negative index
const int def = -2*patlen;
for (i = 0; i != patlen; i++)
{
rpr[i] = def;
}
// r: number of uninitialized entries in rpr[]
int r = patlen;
// Scan pattern from right-to-left until all rpr[] are initialized.
// s: scan position.
// Examine all substrings that end at pat[s] including null string pat[s .. s]
for (s = patlen - 1; r > 0; s--)
{
// m: length of substring pat[s-m .. s]
for (m = 0; m <= patlen - 1 && r > 0; m++)
{
// Introduce j and k (as used in the BM paper)
// j: index of leftmost character of suffix
int j = patlen - m - 1;
// k: index of leftmost character of (possible) recurrence.
int k = s - m;
#ifdef TRACE_DELTA2
const int indent = patlen;
printf("\ns:%d m:%d j:%d k:%d\n", s, m, j, k);
printf("p :%*s%s\n", indent, "", pat);
printf("j :%*s%*.*s\n", indent+j, "", m+1, m+1, &pat[j] );
printf("k-1:%*s", indent+k-1, "");
for (n = 0; n <= m; n++)
{
printf("%c", (k-1+n < 0 ? pat[j+n] : pat[k-1+n]) );
}
printf("\n");
#endif
// We have a match of pat[j+1 .. j+1+m] with pat[k .. k+m]
// Compare pat[j] to pat[k-1].
// Match: extend the substring to the left by increasing m
// Mismatch: terminate the substring and check if plausible RPR
bool mismatch = false;
if (k > 0)
{
if (pat[j] == pat[k-1]) // extend substring
continue;
mismatch = true;
}
// else preceding char, pat[k-1] lies to the left of pat[0]
// which terminates the substring
// We have a match of m (possibly zero) characters.
// pat[j+1 .. j+1+m] matches pat[k .. k+m] and
// either pat[j] != pat[k-1] or k <= 0.
// So rpr[j] = k (unless rpr[j] is already > k)
if (rpr[j] < k)
{
#ifdef TRACE_DELTA2
printf("2 :%*s %c %*.*s %*s s:%d m:%d j:%d k:%d r:%d\n",
indent+j, "",
toupper(pat[j]),
m, m, &pat[j+1],
(patlen-j-1-m), "",
s, m, j, k, r);
#endif
rpr[j] = k;
r--;
}
#ifdef TRACE_DELTA2
else
{
printf("rpr[%d]=%d already inited\n", j, rpr[j]);
}
#endif
// Once we have a mismatch (pat[j] != pat[k-1]) it is fruitless
//to examine further substrings ending at pat[s];
//as Any subpat end with pat[s] will not be the rightmost plausible
//recurrence of the terminal substring pat[j+1 ~ patlen-1]
if (mismatch)
{
break;
}
}
}
for (j = 0; j != patlen; j++)
{
delta2[j] = patlen - rpr[j];
}
#ifdef TRACE_DELTA2
printf("R:"); // trace rpr[] values
for (j = 0; j != patlen; j++)
{
printf(" %3d", rpr[j] );
}
printf("\n");
printf("D:"); // trace delta2[] values
for (j = 0; j != patlen; j++)
{
printf(" %3d", delta2[j] );
}
printf("\n");
#endif
delete [] rpr;
}
/*
* Boyer-Moore search algorithm
*/
const char *boyermoore_search(const char * string, const char *pat)
{
int i = 0, j = 0, stringlen = 0;
const char *result = NULL;
int patlen = strlen(pat);
int *delta1 = NULL;
int *delta2 = NULL;
if (patlen == 0)
goto out;
stringlen = strlen(string);
if (patlen > stringlen)
goto out;
delta1 = new int[ALPHABET_SIZE];
delta2 = new int[patlen];
#ifdef TRACE_BM
printf("pattern: %s\n", pat);
#endif
calc_delta1(pat, patlen, delta1);
calc_delta2(pat, patlen, delta2);
#ifdef TRACE_BM
printf("\nCalculating boyermoore_search>>>>>>>>>>>>>>>>>>>>>>>>>\n");
#endif
// i: index of current string character
for (i = patlen-1;;)
{
if (i > stringlen)
{
result = NULL;
goto out;
}
// j: index of current pattern character
j = patlen-1;
for (;;)
{
if (j == 0)
{
result = &string[i];
goto out;
}
if (string[i] == pat[j])
{
#ifdef TRACE_BM
printf("p:%*s%*.*s%c%*.*s\n", \
(i-j), "", \
j, j, pat, \
toupper(pat[j]), // mark matched char with upcase
patlen-j-1, patlen-j-1, &pat[j+1]);
#endif
j--;
i--;
continue;
}
break;
}
#ifdef TRACE_BM
printf("p:%*s%*.*s%c%*.*s\n",
(i-j), "",
j, j, pat,
L'?', // mark mismatch char
patlen-j-1, patlen-j-1, &pat[j+1]); // which-finally-halts.--at-that-point ...
printf("c:%s\n", string);
#endif
// bc: "bad character" shift amount
int bc = delta1[string[i]];
// gs: "good suffix" shift amount
int gs = delta2[j];
#ifdef TRACE_BM
printf("j:%d bc:%d gs:%d\n\n", j, bc, gs);
#endif
i += __max(bc, gs);
}
/* not found */
out:
delete [] delta1;
delete [] delta2;
return result;
}
void main(void)
{
char src_str[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT";
char pat_str[80] = "AT-THAT";
const char* find_str = NULL;
find_str = boyermoore_search((const char *)src_str, (const char *)pat_str);
if(NULL != find_str)
{
printf("\n Success find string : %s\n", find_str);
}
else
{
printf("no find pattern string !\n");
}
}
Boyer Moore 算法时间复杂度是亚线性O(patLen+n), pattern 越长BM算法效率越高;
1、A Fast String Searching Algorithm
2、http://en.wikipedia.org/wiki/User:RMcPhillip/sandbox/boyer-moore
来源:oschina
链接:https://my.oschina.net/u/227203/blog/180255