Flex rule with a period “.” is not compiling

人走茶凉 提交于 2019-12-05 17:56:57

The simple answer is that [.\n] probably doesn't do what you think it does. Inside a character class, most metacharacters lose their special meaning, so that character class contains only two characters: a literal . and a newline. You should use (.|\n).

But that won't solve the problem.

The underlying cause is the use of a fixed repetition count. Large (or even not so large) repetition counts can result in exponential blow-up of the state machine, if the end of the matched region is ambiguous.

With the repetition of [.\n], the repeated match has an unambiguous termination unless the rest of the regex can start with a dot or a newline. So "." triggers the problem, but "A" doesn't. If you correct the repetition to match any character, then any following character will trigger exponential blow-up. So if you make the change suggested above, the regular expression will continue to be uncompilable.

Changing the repetition count to an indefinite repetition (the star operator) would avoid the problem.


To illustrate the problem, I used the -v option to check the number of states with different repetition counts. This clearly shows the exponential increase in state count, and it's obvious that going much further than 14 repetitions would be impossible. (I didn't show the time consumption; suffice it to say that flex's algorithms are not linear in the size of the DFA, so while each additional repetition doubles the number of states, it roughly quadruples the time consumption; at 16 states, flex took 45 seconds, so it's reasonable to assume that it would take about a week to do 23 repetitions, provided that the 6GB of RAM it would need was available without too much swapping. I didn't try the experiment.)

$ cat badre.l
%%
"on"[ \t\r]*[.\n]{0,XXX}"."[ \t\r]*[.\n]{0,XXX}"from"
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
>   printf '{0,%d}:\t%24s\n' $i \
>      "$(flex -v -o /dev/null <( sed "s/XXX/$i/g" badre.l) |&
>         grep -o '.*DFA states')"
> done
{0,1}:        17/1000 DFA states
{0,2}:        25/1000 DFA states
{0,3}:        41/1000 DFA states
{0,4}:        73/1000 DFA states
{0,5}:       137/1000 DFA states
{0,6}:       265/1000 DFA states
{0,7}:       521/1000 DFA states
{0,8}:      1033/2000 DFA states
{0,9}:      2057/3000 DFA states
{0,10}:     4105/6000 DFA states
{0,11}:    8201/11000 DFA states
{0,12}:   16393/21000 DFA states
{0,13}:   32777/41000 DFA states
{0,14}:   65545/82000 DFA states

Changing the regex to use (.|\n) for both repetitions roughly triples the number of states, because with that change both repetitions become ambiguous (and there is an interaction between the two of them).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!