Character-by-character description of flex scanner

问题

I am having a really hard time tracking down a bug in a rather large flex/bison parser (1000 grammar rules, 1500 states, 400 terminals). The scanner matches a terminal that should not arise at this particular point and is not present in the data file.

The input I am trying to parse is

<el Re="1.0" Im="-1.0"/>

and the last few lines of the output are

Reading a token: Next token is token ELEMENTTEXT (1.1-1.1: )
matched 4 characters:  Re=
matched 1 characters: "
matched 6 characters: -1 Im=

This looks like a memory corruption, since '-1 Im' is not present in the source. I expected the next token to be '1.0', which matches the token aNumber.

I have checked everything I can think of, I turned on bison debugging --- which confused me more, and am now trying to play through the innards of the scanner one character at a time. Is there any tool that could provide me output along the lines of:

next character matched "x" - possible terminals
    ONE
    TWO
    SEVEN
...

回答1:

I gather that the debugging output being shown is generated in the parser, rather than from the scanner. The best way to see debugging output in the scanner is to generated the scanner using the -d or --debug command-line options, or put %option debug in your flex scanner definition. That will print a line to stderr for every matched rule.

DFA-based regex recognition does not provide meaningful character-by-character debugging output; in theory, the progress of the state machine could be traced but it would be very difficult to interpret and probably not all that useful.

The apparently corrupted information in your debugging output in the parser is most likely the result of a scanner action like this:

{some_pattern}       { /* DO NOT DO THIS */ yylval.str = yytext; 
                       return SOME_TOKEN;
                     }

The value of yytext and the memory it points into are private to the scanner yylex, and the values can change without notice. In particular, once yylex is called again to scan the lookahead token, the buffer may well be moved around in unpredictable ways.

Instead, you must make a copy of the token string (and remember to free the copy when you no longer need it):

{some_pattern}       { yylval.str = strdup(yytext); 
                       return SOME_TOKEN;
                     }

Note: If you don't want to use strdup (perhaps because your token might include NUL characters), a good alternative is:

char* buf = malloc(yyleng + 1); /* No need to call strlen */
memcpy(buf, yytext, yyleng);    /* Works even if there is a NUL in the token */
buf[yyleng] = 0;                /* Remember to NUL-terminate the copy */

References: flex manual note on yytext / bison FAQ on destroyed strings

来源：https://stackoverflow.com/questions/33131787/character-by-character-description-of-flex-scanner

标签

debugging

flex-lexer