Efficient algorithm for converting a character set into a nfa/dfa

前端未结

关注

 5  772

I\'m currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.

The scanner generator pr

相关标签:

5条回答

野的像风

2021-02-14 06:50

In this library (http://mtimmerm.github.io/dfalex/) I do it by putting a range of consecutive characters on each transition, instead of single characters. This is carried through all the steps of NFA constuction, NFA->DFA conversion, DFA minimization, and optimization.

It's quite compact, but it adds code complexity to every step.

0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2021-02-14 06:59

There are a number of ways to handle it. They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all. It's also how you make scanners for Unicode in a reasonable amount of memory.

You've many choices about how to represent and process sets of characters. I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states. You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture. In fact, it's fast enough that it runs in Python with acceptable speed.

0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2021-02-14 07:03
I'll clarify what I think you're asking for: to union a set of Unicode codepoints such that you produce a state-minimal DFA where transitions represent UTF8-encoded sequences for those codepoints.

When you say "more efficiently", that could apply to runtime, memory usage, or to compactness of the end result. The usual meaning for "minimal" in finite automata refers to using the fewest states to describe any given language, which is what you're getting at by "create only the necessary states".

Every finite automata has exactly one equivalent state minimal DFA (see the Myhill-Nerode theorem [1], or Hopcroft & Ullman [2]). For your purposes, we can construct this minimal DFA directly using the Aho-Corasick algorithm [3].

To do this, we need a mapping from Unicode codepoints to their corresponding UTF8 encodings. There's no need to store all of these UTF8 byte sequences in advance; they can be encoded on the fly. The UTF8 encoding algorithm is well documented and I won't repeat it here.

Aho-Corasick works by first constructing a trie. In your case this would be a trie of each UTF8 sequence added in turn. Then that trie is annotated with transitions turning it into a DAG per the rest of the algorithm. There's a nice overview of the algorithm here, but I suggest reading the paper itself.

Pseudocode for this approach:
```
trie = empty
foreach codepoint in input_set:
   bytes[] = utf8_encode(codepoint)
   trie_add_key(bytes)
dfa = add_failure_edges(trie) # per the rest of AC
```
This approach (forming a trie of UTF8-encoded sequences, then Aho-Corasick, then rendering out DFA) is the approach taken in the implementation for my regexp and finite state machine libraries, where I do exactly this for constructing Unicode character classes. Here you can see code for:
- UTF8-encoding a Unicode codepoint: examples/utf8dfa/main.c
- Construction of the trie: libre/ac.c
- Rendering out of minimal DFA for each character class: libre/class/
Other approaches (as mentioned in other answers to this question) include working on codepoints and expressing ranges of codepoints, rather than spelling out every byte sequence.

[1] Myhill-Nerode: Nerode, Anil (1958), Linear Automaton Transformations, Proceedings of the AMS, 9, JSTOR 2033204
[2] Hopcroft & Ullman (1979), Section 3.4, Theorem 3.10, p.67
[3] Aho, Alfred V.; Corasick, Margaret J. (June 1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM. 18 (6): 333–340.
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2021-02-14 07:05
I had the same problem with my scanner generator, so I've come up with the idea of replacing intervals by their ids which is determined using interval tree. For instance a..z range in dfa can be represented as: 97, 98, 99, ..., 122, instead I represent ranges as [97, 122], then build interval tree structure out of them, so at the end they are represented as ids that is referring to the interval tree. Given the following RE: a..z+, we end up with such DFA:
```
0 -> a -> 1
0 -> b -> 1
0 -> c -> 1
0 -> ... -> 1
0 -> z -> 1

1 -> a -> 1
1 -> b -> 1
1 -> c -> 1
1 -> ... -> 1
1 -> z -> 1
1 -> E -> ACCEPT
```
Now compress intervals:
```
0 -> a..z -> 1

1 -> a..z -> 1
1 -> E -> ACCEPT
```
Extract all intervals from your DFA and build interval tree out of them:
```
{
    "left": null,
    "middle": {
        id: 0,
        interval: [a, z],
    },
    "right": null
}
```
Replace actual intervals to their ids:
```
0 -> 0 -> 1
1 -> 0 -> 1
1 -> E -> ACCEPT
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2021-02-14 07:12

Look at what regular expression libraries like Google RE2 and TRE are doing.

0 讨论(0)
发布评论:

提交评论
- 加载中...