问题
I am not able to understand the below algorithm which is used for string pattern matching using Aho-Corasick alg.
Procedure AC(y,n,q0)
INPUT: y<-array of m bytes representing the text input
(SQL Query Statement)
n<-integer representing the text length
(SQL Query Length)
q0<-initial state (first character in pattern)
2: State <-q0
3: For i = 1 to n do
4: While g ( State, y[i] = = fail) do
5: State ← f (State)
6: End While
7: State ← g(State,.y[i])
8: If o(State) then
9: Output i
10: Else
11: Output
12: End If
13: End for
14: End Procedure
回答1:
You probably won't gain a good understanding of the Aho-Corasick algorithm from reading a little bit of pseudocode. Unless you understand the state transition table, the algorithm will make no sense at all.
There's a decent explanation along with an animation at Aho-Corasick implementation and animation.
The original paper, Efficient String Matching: An Aid to Bibliographic Search(PDF), is well written and understandable, and the pseudocode examples are pretty easy to convert to working code. It'll take a little study, but you should have a good understanding after you read the paper, think about it a bit, and then read it again.
回答2:
Aho Corasick algorithm is used to solve set matching problem. It means we have a set of strings S, and here comes a long string L to check whether L contains any one in the previous set S.
An basic solution is using a trie tree, i.e. a prefix tree, please see Wikipedia. There are typically two steps to deal with the problem.
- Precompute the trie tree based on the string set S;
- Scan the string L to check.
The trie tree is easy to understand. It store prefix substrings with nodes starts from the root.
The Aho-Corasick algorithm is an extension of a trie tree, not far from the basic idea. Aho-Corasick algorithm adds a failed pointer to each node on a trie tree.
When fails, a trie tree will restart from the root (add the start index on L by 1), but Aho-Corasick algorithm will restart from the node D pointed by the failed pointer (add the start index on L by the depth of the node D).
Below is a C++ implementation of the Aho-Corasick algorithm. It contains some bugs. http://www.komodia.com/aho-corasick
I fixed the bugs I found. And you can access my version here: https://github.com/elfinhe/KomodiaAhoCorasick
来源:https://stackoverflow.com/questions/20328989/aho-corasick-algorithm