Character-by-character description of flex scanner

泪湿孤枕 提交于 2020-01-07 09:20:52

问题


I am having a really hard time tracking down a bug in a rather large flex/bison parser (1000 grammar rules, 1500 states, 400 terminals). The scanner matches a terminal that should not arise at this particular point and is not present in the data file.

The input I am trying to parse is

<el Re="1.0" Im="-1.0"/>

and the last few lines of the output are

Reading a token: Next token is token ELEMENTTEXT (1.1-1.1: )
matched 4 characters:  Re=
matched 1 characters: "
matched 6 characters: -1 Im=

This looks like a memory corruption, since '-1 Im' is not present in the source. I expected the next token to be '1.0', which matches the token aNumber.

I have checked everything I can think of, I turned on bison debugging --- which confused me more, and am now trying to play through the innards of the scanner one character at a time. Is there any tool that could provide me output along the lines of:

next character matched "x" - possible terminals
    ONE
    TWO
    SEVEN
...

回答1:


I gather that the debugging output being shown is generated in the parser, rather than from the scanner. The best way to see debugging output in the scanner is to generated the scanner using the -d or --debug command-line options, or put %option debug in your flex scanner definition. That will print a line to stderr for every matched rule.

DFA-based regex recognition does not provide meaningful character-by-character debugging output; in theory, the progress of the state machine could be traced but it would be very difficult to interpret and probably not all that useful.

The apparently corrupted information in your debugging output in the parser is most likely the result of a scanner action like this:

{some_pattern}       { /* DO NOT DO THIS */ yylval.str = yytext; 
                       return SOME_TOKEN;
                     }

The value of yytext and the memory it points into are private to the scanner yylex, and the values can change without notice. In particular, once yylex is called again to scan the lookahead token, the buffer may well be moved around in unpredictable ways.

Instead, you must make a copy of the token string (and remember to free the copy when you no longer need it):

{some_pattern}       { yylval.str = strdup(yytext); 
                       return SOME_TOKEN;
                     }

Note: If you don't want to use strdup (perhaps because your token might include NUL characters), a good alternative is:

char* buf = malloc(yyleng + 1); /* No need to call strlen */
memcpy(buf, yytext, yyleng);    /* Works even if there is a NUL in the token */
buf[yyleng] = 0;                /* Remember to NUL-terminate the copy */

References: flex manual note on yytext / bison FAQ on destroyed strings



来源:https://stackoverflow.com/questions/33131787/character-by-character-description-of-flex-scanner

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!