I\'m currently working on a scanner generator. The generator already works fine. But when using character classes the algorithm gets very slow.
The scanner generator pr
I'll clarify what I think you're asking for: to union a set of Unicode codepoints such that you produce a state-minimal DFA where transitions represent UTF8-encoded sequences for those codepoints.
When you say "more efficiently", that could apply to runtime, memory usage, or to compactness of the end result. The usual meaning for "minimal" in finite automata refers to using the fewest states to describe any given language, which is what you're getting at by "create only the necessary states".
Every finite automata has exactly one equivalent state minimal DFA (see the Myhill-Nerode theorem [1], or Hopcroft & Ullman [2]). For your purposes, we can construct this minimal DFA directly using the Aho-Corasick algorithm [3].
To do this, we need a mapping from Unicode codepoints to their corresponding UTF8 encodings. There's no need to store all of these UTF8 byte sequences in advance; they can be encoded on the fly. The UTF8 encoding algorithm is well documented and I won't repeat it here.
Aho-Corasick works by first constructing a trie. In your case this would be a trie of each UTF8 sequence added in turn. Then that trie is annotated with transitions turning it into a DAG per the rest of the algorithm. There's a nice overview of the algorithm here, but I suggest reading the paper itself.
Pseudocode for this approach:
trie = empty
foreach codepoint in input_set:
bytes[] = utf8_encode(codepoint)
trie_add_key(bytes)
dfa = add_failure_edges(trie) # per the rest of AC
This approach (forming a trie of UTF8-encoded sequences, then Aho-Corasick, then rendering out DFA) is the approach taken in the implementation for my regexp and finite state machine libraries, where I do exactly this for constructing Unicode character classes. Here you can see code for:
UTF8-encoding a Unicode codepoint: examples/utf8dfa/main.c
Construction of the trie: libre/ac.c
Rendering out of minimal DFA for each character class: libre/class/
Other approaches (as mentioned in other answers to this question) include working on codepoints and expressing ranges of codepoints, rather than spelling out every byte sequence.
[1] Myhill-Nerode: Nerode, Anil (1958), Linear Automaton Transformations, Proceedings of the AMS, 9, JSTOR 2033204
[2] Hopcroft & Ullman (1979), Section 3.4, Theorem 3.10, p.67
[3] Aho, Alfred V.; Corasick, Margaret J. (June 1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM. 18 (6): 333–340.