问题
I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like:
\bBarack\s(Hussein\s)?Obama\b \b(John|J\.)\sBoehner\b
etc.
My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better?
回答1:
You can use (f)lex to generate a DFA, which recognises all the literals in parallel. This might get tricky if there are too many wildcards present, but it works for upto about 100 literals (for a 4 letter alfabet; probably more for natural text). You may want to suppress the default action (ECHO), and only print the line+column numbers of the matches.
[ I assume grep -F does about the same ]
%{
/* C code to be copied verbatim */
#include <stdio.h>
%}
%%
"TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again" ); }
"AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!" ); }
...
more lines
...
[bd-fh-su-z]+ {;}
[ \t\r\n]+ {;}
. {;}
%%
int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}
A script like the one above can be generated form txt input with awk or any other script language.
回答2:
A slightly smarter implementation than running every regex on every file:
For each regex:
load regex into a regex engine
assemble a list of regex engines
For each byte in the file:
insert byte to every regex engine
print results if there are matches
But I don't know of any programs that do this already - you'd have to code it yourself. This also implies you have the ram to keep the regex state around, and that you don't have any evil regexes
回答3:
I'm not sure if you'd blow some regex size limit, but you could just OR them all up together into one giant regex:
((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc))
If you hit some limit, you could do this with chunks of 100 at a time or however many you can manage
回答4:
If you need really fast implementation of some specific case, you can implement suffix tree of Aho–Corasick algorithm yourself. But in most cases union of all your regexes into single regex, as recommended earlier, will be not bad too
来源:https://stackoverflow.com/questions/8697456/fast-algorithm-to-extract-thousands-of-simple-patterns-out-of-large-amounts-of-t