Fast algorithm to extract thousands of simple patterns out of large amounts of text

问题

I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like:

\bBarack\s(Hussein\s)?Obama\b
\b(John|J\.)\sBoehner\b

etc.

My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better?

回答1:

You can use (f)lex to generate a DFA, which recognises all the literals in parallel. This might get tricky if there are too many wildcards present, but it works for upto about 100 literals (for a 4 letter alfabet; probably more for natural text). You may want to suppress the default action (ECHO), and only print the line+column numbers of the matches.

[ I assume grep -F does about the same ]

%{
/* C code to be copied verbatim */
#include <stdio.h>
%}

%%

"TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again"  ); }


"AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!"  ); } 

... 
more lines
...

[bd-fh-su-z]+ {;}

[ \t\r\n]+ {;}

. {;}

%%

int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}

A script like the one above can be generated form txt input with awk or any other script language.

回答2:

A slightly smarter implementation than running every regex on every file:

For each regex:
    load regex into a regex engine
    assemble a list of regex engines
For each byte in the file:
    insert byte to every regex engine
      print results if there are matches

But I don't know of any programs that do this already - you'd have to code it yourself. This also implies you have the ram to keep the regex state around, and that you don't have any evil regexes

回答3:

I'm not sure if you'd blow some regex size limit, but you could just OR them all up together into one giant regex:

((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc))

If you hit some limit, you could do this with chunks of 100 at a time or however many you can manage

回答4:

If you need really fast implementation of some specific case, you can implement suffix tree of Aho–Corasick algorithm yourself. But in most cases union of all your regexes into single regex, as recommended earlier, will be not bad too

来源：https://stackoverflow.com/questions/8697456/fast-algorithm-to-extract-thousands-of-simple-patterns-out-of-large-amounts-of-t

标签

regex

algorithm

named-entity-extraction