Copying entire input line in (f)lex (for better error messages)?

问题

As part of a typical parser using yacc (or bison) and lex (or flex), I'd like to copy entire input lines in the lexer so that, if there's an error later, the program can print out the offending line in its entirety and put a caret ^ under the offending token.

To copy the line, I'm currently doing:

char *line;        // holds copy of entire line
bool copied_line;

%%

^.+  {
       if ( !copied_line ) {
          free( line );
          line = strdup( yytext );
          copied_line = true;
       }
       REJECT;
     }

/* ... other tokens ... */

\n   { copied_line = false; return END; }

This works, but, from stepping in a debugger, it's really inefficient. What seems to be going on is that the REJECT is causing the lexer to back off one character at a time rather than just jumping to the next possible match.

Is there a better, more efficient way to get what I want?

回答1:

Here's a possible definition of YY_INPUT using getline(). It should work as long as no token includes both a newline character and the following character. (A token could include a newline character at the end.) Specifically, current_line will contain the last line of the current token.

On successful completion of the lexical scan, current_line will be freed and the remaining global variables reset so that another input can be lexically analysed. If the lexical scan is discontinued before end of input is reached (for example, because the parse was unsuccessful), an explicit call should be made to reset_current_line() in order to perform these tasks.

char* current_line = NULL;
size_t current_line_alloc = 0;
ssize_t current_line_sent = 0;
ssize_t current_line_len = 0;

void reset_current_line() {
  free(current_line);
  current_line = NULL;
  current_line_alloc = current_line_sent = current_line_len = 0;
}

ssize_t refill_flex_buffer(char* buf, size_t max_size) {
  ssize_t avail = current_line_len - current_line_sent;
  if (!avail) {
    current_line_sent = 0;
    avail = getline(&current_line, &current_line_alloc, stdin);
    if (avail < 0) {
      if (ferror(stdin)) { perror("Could not read input: "); }
      avail = 0;
    }
    current_line_len = avail;
  }
  if (avail > max_size) avail = max_size;
  memcpy(buf, current_line + current_line_sent, avail);
  current_line_sent += avail;
  if (!avail) reset_current_line();
  return avail;
}

#define YY_INPUT(buf, result, max_size) \
  result = refill_flex_buffer(buf, max_size);

Although the above code does not depend on maintaining the current column position, it is important if you want to identify where the current token is in the current line. The following will help provided you don't use yyless or yymore:

size_t current_col = 0, current_col_end = 0;
/* Call this in any token whose last character is \n,
 * but only after making use of column information.
 */
void reset_current_col() {
  current_col = current_col_end = 0;
}
#define YY_USER_ACTION \
  { current_col = current_col_end; current_col_end += yyleng; }

If you are using this scanner with a parser with lookahead, it may not be sufficient to keep only one line of the input stream, since the lookahead token may be on a subsequent line to the error token. Keeping several retained lines in a circular buffer would be a simple enhancement, but it is not at all obvious how many lines are necessary.

回答2:

Based on the hint from @Serge Ballesta of using YY_INPUT:

#define YY_INPUT( BUF, RESULT, MAX_SIZE ) \
  (RESULT) = lexer_get_input( (BUF), (MAX_SIZE) )

static size_t column;     // current 0-based column
static char  *input_line;

static size_t lexer_get_input( char *buf, size_t buf_size ) {
  size_t bytes_read = 0;

  for ( ; bytes_read < buf_size; ++bytes_read ) {
    int const c = getc( yyin );
    if ( c == EOF ) {
      if ( ferror( yyin ) )
        /* complain and exit */;
      break;
    }
    buf[ bytes_read ] = (char)c;
    if ( c == '\n' )
      break;
  } // for

  if ( column == 0 && bytes_read < buf_size ) {
    static size_t input_line_capacity;
    if ( input_line_capacity < bytes_read + 1/*null*/ ) {
      input_line_capacity = bytes_read + 1/*null*/;
      input_line = (char*)realloc( input_line, input_line_capacity );
    }
    strncpy( input_line, buf, bytes_read );
    input_line_len = bytes_read;
    input_line[ input_line_len ] = '\0';
  }

  return bytes_read;
}

The first time this is called, column will be 0, so it will copy the entire line into input_line. On subsequent calls, nothing special needs to be done. Eventually, column will be reset to 0 upon encountering a newline; then the next time the function is called, it will again copy the line.

This seems to work and is a lot more efficient. Anybody see any problems with it?

回答3:

Under the assumption that your input stems from a seekable stream:

Count the number N of newlines encountered
In case of error, seek and output line N + 1

Even if input is from a non-seekable stream, you could save all characters in a temporary store.

Variations on this theme are possible, such as storing the offset of the last newline seen, so you can directly seek to it.

回答4:

In flex, you can use YY_USER_ACTION, which, if defined as a macro, will run for every token, just before running the token action. So something like:

#define YY_USER_ACTION  append_to_buffer(yytext);

will append yytext to a buffer where you can later use it.

来源：https://stackoverflow.com/questions/43246147/copying-entire-input-line-in-flex-for-better-error-messages

标签

flex-lexer

lex