I am working on writing a flex scanner for a language supporting nested comment like this:
/*
/**/
*/
I use to work on ocaml/ocamllex that supp
Assuming that only comments can be nested in comments, a stack is a very expensive solution for what could be achieved with a simple counter. For example:
%x SC_COMMENT
%%
int comment_nesting = 0; /* Line 4 */
"/*" { BEGIN(SC_COMMENT); }
{
"/*" { ++comment_nesting; }
"*"+"/" { if (comment_nesting) --comment_nesting;
else BEGIN(INITIAL); }
"*"+ ; /* Line 11 */
[^/*\n]+ ; /* Line 12 */
[/] ; /* Line 13 */
\n ; /* Line 14 */
}
Some explanations:
Line 4: Indented lines before the first rule are inserted at the top of the yylex
function where they can be used to declare and initialize local variables. We use this to initialize the comment nesting depth to 0 on every call to yylex
. The invariant which must be maintained is that comment_nesting
is always 0 in the INITIAL
state.
Lines 11-13: A simpler solution would have been the single pattern .|\n
. , but that would result in every comment character being treated as a separate subtoken. Even though the corresponding action does nothing, this would have caused the scan loop to be broken and the action switch statement to be executed for every character. So it is usually better to try to match several characters at once.
We need to be careful about / and * characters, though; we can only ignore those asterisks which we are certain are not part of the */ which terminates the (possibly nested) comment. Hence lines 11 and 12. (Line 12 won't match a sequence of asterisks which is followed by a / because those will already have been matched by the pattern above, at line 9.) And we need to ignore / if it is not followed by a *. Hence line 13.
Line 14: However, it can also be sub-optimal to match too large a token.
First, flex is not optimized for large tokens, and comments can be very large. If flex needs to refill its buffer in the middle of a token, it will retain the open token in the new buffer, and then rescan from the beginning of the token.
Second, flex scanners can automatically track the current line number, and they do so relatively efficiently. The scanner checks for newlines only in tokens matched by patterns which could possibly match a newline. But the entire match needs to be scanned.
We reduce the impact of both of these issues by matching newline characters inside comments as individual tokens. (Line 14, also see line 12) This limits the yylineno
scan to a single character, and it also limits the expected length of internal comment tokens. The comment itself might be very large, but each line is likely to be limited to a reasonable length, thus avoiding the potentially quadratic rescan on buffer refill.