问题
I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.
I have run into Boyer-Moore string searches in C++ and Python, but the only Delphi Boyer-Moore algorithm used to implement fast search and replace that I have found is part of the FastStrings by Peter Morris, formerly of DroopyEyes software, and his website and email are no longer working.
I have already ported FastStrings forward to work great for AnsiStrings in Delphi 2009/2010, where a byte is equal to one AnsiChar, but making them also work with the String (UnicodeString) in Delphi 2010 appears non-trivial.
Using this Boyer-Moore algorithm, it should be possible to easily do case insensitive searches, as well as case-insensitive search and replace, without any temporary string (using StrUpper etc), and without calling Pos() which is slower than Boyer-Moore searching when repeated searches over the same text are required.
(Edit: I have a partial solution, written as an answer to this question, it is almost 100% complete, it even has a fast string replace function. I believe it MUST have bugs, and especially think that since it pretends to be Unicode capable that it must be that there are glitches due to unfulfilled Unicode promises. )
(Edit2: Interesting and unexpected result; The large stack size of a unicode code-point table on the stack - SkipTable in the code below puts a serious damper on the amount of win-win-optimization you can do here in a unicode string boyer-moore string search. Thanks to Florent Ouchet for pointing out what I should have noticed immediately.)
回答1:
This answer is now complete and works for case sensitive mode, but does not work for case insensitive mode, and probably has other bugs too, since it's not well unit tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages).
Based on Dorin Dominica's code, I built the following.
{ _FindStringBoyer:
Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if bCaseSensitive then
Result := (sString[StringIndex] = sPattern[PatternIndex])
else
Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
var
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
begin
if fromPos < 1 then
raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
LengthPattern := Length(sPattern);
LengthString := Length(sString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[sPattern[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[sPattern[LengthPattern]];
SkipTable[sPattern[LengthPattern]] := Large;
Index := fromPos + LengthPattern -1;
Result := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[sString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
// Found, return.
Result := Index - kIndex + 1;
Index := Index + LengthPattern;
exit;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[sString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
a substring inside the main string, at a much faster rate than we
could have done otherwise. Another thing that would be great is
to have a function that returns an array of find-locations,
which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
foundPos:Integer;
fromPos:Integer;
Limit:Integer;
guard:Integer;
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if CaseSensitive then
Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
else
Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
result := 0;
foundPos := 1;
fromPos := 1;
Limit := Length(aSourceString);
guard := Length(aFindString);
Index := 0;
LengthPattern := Length(aFindString);
LengthString := Length(aSourceString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[aFindString[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[aFindString[LengthPattern]];
SkipTable[aFindString[LengthPattern]] := Large;
while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin
Index := fromPos + LengthPattern -1;
if Index>Limit then
break;
kIndex := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[aSourceString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
// Found, return.
//Result := Index - kIndex + 1;
Index := Index + LengthPattern;
fromPos := Index;
Inc(Result);
break;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[aSourceString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
end;
This is really a nice Algorithm, because:
- it's way faster to count instances of substring X in string Y this way, magnificently so.
- For merely replacing Pos() the _FindStringBoyer() is faster than the pure asm version of Pos() contributed to Delphi by FastCode project people, that is currently used for Pos, and if you need the case-insensitivity, you can imagine the performance boost when we don't have to call UpperCase on a 100 megabyte string. (Okay, your strings aren't going to be THAT big. But still, Efficient Algorithms are a Thing of Beauty.)
Okay I wrote a String Replace in Boyer-Moore style:
function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
errors:Integer;
fromPos:Integer;
Limit:Integer;
guard:Integer;
SkipTable: array [Char] of Integer;
LengthPattern: Integer;
LengthString: Integer;
Index: Integer;
kIndex: Integer;
LastMarker: Integer;
Large: Integer;
chPattern: Char;
CaseSensitive:Boolean;
foundAt:Integer;
lastFoundAt:Integer;
copyStartsAt:Integer;
copyLen:Integer;
function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
if CaseSensitive then
Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
else
Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
begin
result := '';
lastFoundAt := 0;
fromPos := 1;
errors := 0;
CaseSensitive := rfIgnoreCase in Flags;
Limit := Length(aSourceString);
guard := Length(aFindString);
Index := 0;
LengthPattern := Length(aFindString);
LengthString := Length(aSourceString);
for chPattern := Low(Char) to High(Char) do
SkipTable[chPattern] := LengthPattern;
for Index := 1 to LengthPattern -1 do
SkipTable[aFindString[Index]] := LengthPattern - Index;
Large := LengthPattern + LengthString + 1;
LastMarker := SkipTable[aFindString[LengthPattern]];
SkipTable[aFindString[LengthPattern]] := Large;
while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin
Index := fromPos + LengthPattern -1;
if Index>Limit then
break;
kIndex := 0;
foundAt := 0;
while Index <= LengthString do begin
repeat
Index := Index + SkipTable[aSourceString[Index]];
until Index > LengthString;
if Index <= Large then
Break
else
Index := Index - Large;
kIndex := 1;
while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
Inc(kIndex);
if kIndex = LengthPattern then begin
foundAt := Index - kIndex + 1;
Index := Index + LengthPattern;
//fromPos := Index;
fromPos := (foundAt+LengthPattern);
if lastFoundAt=0 then begin
copyStartsAt := 1;
copyLen := foundAt-copyStartsAt;
end else begin
copyStartsAt := lastFoundAt+LengthPattern;
copyLen := foundAt-copyStartsAt;
end;
if (copyLen<=0)or(copyStartsAt<=0) then begin
Inc(errors);
end;
Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
lastFoundAt := foundAt;
if not (rfReplaceAll in Flags) then
fromPos := 0; // break out of outer while loop too!
break;
end else begin
if __SameChar(Index, LengthPattern) then
Index := Index + LastMarker
else
Index := Index + SkipTable[aSourceString[Index]];
end; // if kIndex = LengthPattern then begin
end; // while Index <= LengthString do begin
end;
if (lastFoundAt=0) then
begin
// nothing was found, just return whole original string
Result := aSourceString;
end
else
if (lastFoundAt+LengthPattern < Limit) then begin
// the part that didn't require any replacing, because nothing more was found,
// or rfReplaceAll flag was not specified, is copied at the
// end as the final step.
copyStartsAt := lastFoundAt+LengthPattern;
copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
end;
end;
Okay, problem: Stack footprint of this:
var
skiptable : array [Char] of Integer; // 65536*4 bytes stack usage on Unicode delphi
Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.
回答2:
Since I was just looking for the same: Jedi JCL has got a unicode aware search engine using Boyer-Moore in jclUnicode.pas. I have no idea how good or how fast it is yet.
来源:https://stackoverflow.com/questions/3310865/is-there-a-boyer-moore-string-search-and-fast-search-and-replace-function-and-fa