Flex rule with a period “.” is not compiling

问题

I am facing a problem compiling this regular expression with flex

"on"[ \t\r]*[.\n]{0,300}"."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

I had 100 hundred rules in rules section of flex specification file. I tried to compile it flex -Ce -Ca rule.flex I waited for 10 hours still it didn't complete so I killed it. I started to find the issue and narrowed down the problem to this rule. If I remove this rule from 100 rules, it takes 21 seconds to compile it to C code.

If I replace the period with some other character it compiles successfully. e.g.

"on"[ \t\r]*[.\n]{0,300}"A"[ \t\r]*[.\n]{0,300}"from"    {counter++;}

compiles in no time. Even a period followed/preceded by a space character compiles quickly

"on"[ \t\r]*[.\n]{0,300}" ."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

I can see from flex manual that "." matches literal "."

What is wrong with my rule?

回答1:

The simple answer is that [.\n] probably doesn't do what you think it does. Inside a character class, most metacharacters lose their special meaning, so that character class contains only two characters: a literal . and a newline. You should use (.|\n).

But that won't solve the problem.

The underlying cause is the use of a fixed repetition count. Large (or even not so large) repetition counts can result in exponential blow-up of the state machine, if the end of the matched region is ambiguous.

With the repetition of [.\n], the repeated match has an unambiguous termination unless the rest of the regex can start with a dot or a newline. So "." triggers the problem, but "A" doesn't. If you correct the repetition to match any character, then any following character will trigger exponential blow-up. So if you make the change suggested above, the regular expression will continue to be uncompilable.

Changing the repetition count to an indefinite repetition (the star operator) would avoid the problem.

To illustrate the problem, I used the -v option to check the number of states with different repetition counts. This clearly shows the exponential increase in state count, and it's obvious that going much further than 14 repetitions would be impossible. (I didn't show the time consumption; suffice it to say that flex's algorithms are not linear in the size of the DFA, so while each additional repetition doubles the number of states, it roughly quadruples the time consumption; at 16 states, flex took 45 seconds, so it's reasonable to assume that it would take about a week to do 23 repetitions, provided that the 6GB of RAM it would need was available without too much swapping. I didn't try the experiment.)

$ cat badre.l
%%
"on"[ \t\r]*[.\n]{0,XXX}"."[ \t\r]*[.\n]{0,XXX}"from"
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
>   printf '{0,%d}:\t%24s\n' $i \
>      "$(flex -v -o /dev/null <( sed "s/XXX/$i/g" badre.l) |&
>         grep -o '.*DFA states')"
> done
{0,1}:        17/1000 DFA states
{0,2}:        25/1000 DFA states
{0,3}:        41/1000 DFA states
{0,4}:        73/1000 DFA states
{0,5}:       137/1000 DFA states
{0,6}:       265/1000 DFA states
{0,7}:       521/1000 DFA states
{0,8}:      1033/2000 DFA states
{0,9}:      2057/3000 DFA states
{0,10}:     4105/6000 DFA states
{0,11}:    8201/11000 DFA states
{0,12}:   16393/21000 DFA states
{0,13}:   32777/41000 DFA states
{0,14}:   65545/82000 DFA states

Changing the regex to use (.|\n) for both repetitions roughly triples the number of states, because with that change both repetitions become ambiguous (and there is an interaction between the two of them).

来源：https://stackoverflow.com/questions/35862023/flex-rule-with-a-period-is-not-compiling

标签

regex

flex-lexer

lex