Lex: identifier vs integer

问题

I'm trying to create my own simple programming language. For this I need to insert some regex into Lex. I'm using the following regex to match identifiers and integers.

[a-zA-Z][a-zA-Z0-9]* /* identifier */ return IDENTIFIER;
("+"|"-")?[0-9]+ /* integer */ return INTEGER;

Now when I check for example an illegal identifier like:

0a = 1;

The leading zero is recognized as an integer followed by the 'a' recognized as an identifier. Instead of this I want this token '0a' to be recognized as an illegal character. How do I include this functionality? What regex do I have to adjust?

回答1:

The easiest way to do this in (F)lex is to create a second pattern for the error:

[[:alpha:]][[:alnum:]]*  return IDENTIFIER;
[+-]?[[:digit:]]+        return INTEGER;
[+-]?[[:digit:]]+[[:alpha:]]   {
                           fprintf(stderr,
                                   "Incorrect integer '%s' in line %d\n",
                                   yytext, yylineno);
                           return ERROR;
                         }

The third rule will match any integer with a letter immediately following, and will signal a lexical error. (I'm assuming you've enable %option yylineno. If not, that will always report the error on line 0.)

An alternative might be to continue the lexical scan. In this case, you might want to rescan the offending alphabetic character. The easiest way to do this is in Flex is to use it's (idiosyncratic) trailing context operator /:

[[:alpha:]][[:alnum:]]*  return IDENTIFIER;
[+-]?[[:digit:]]+        return INTEGER;
[+-]?[[:digit:]]+/[[:alpha:]]   {
                           fprintf(stderr, 
                                   "Warning: Incorrect integer '%s' in line %d\n",
                                   yytext, yylineno);
                           return INTEGER;
                         }

Now the third rule will match exactly the same thing, but after it matches it will back off to the end of the number so that the next lexeme will start with the alphabetic character.

You can also do this with the yyless() macro:

yyless(n) returns all but the first n characters of the current token back to the input stream…

So you could use:

[[:alpha:]][[:alnum:]]*  return IDENTIFIER;
[+-]?[[:digit:]]+        return INTEGER;
[+-]?[[:digit:]]+[[:alpha:]]   {
                           fprintf(stderr, 
                                   "Warning: Incorrect integer '%s' in line %d\n",
                                   yytext, yylineno);
                           yyless(yyleng - 1);
                           return INTEGER;
                         }

Finally, as @CharlieBurns points out in a comment, you can just let the lexer return two tokens (one number and one identifier) to the parser, which will recognize a syntax error if that sequence is illegal in the language. In many programming languages, no grammatical program can contain an integer immediately followed by an identifier without some punctuation in between.

However, in many other languages, the combination is perfectly reasonable, particularly in languages like Lua where there is no explicit end-of-statement indicator, so

 b = 3 a = 4

is a valid program consisting of two assignment statements. As another example, in Awk string concatenation is represented with no operator and numbers are automatically coerced to strings if necessary, so

print 3 a

will print the concatenation of "3" and the value of a. Lua insists on whitespace in the above example; Awk does not.

And, for the ultimate, C(++) considers 3a to be a single token, a "pre-processing number". If the token actually passes through the preprocessor, an error will be flagged, but the following program has no syntax errors:

#define NOTHING(x)
NOTHING(3a)

As a possibly more interesting example:

#define CONCAT2(a,b) a##b
#define CONCAT(a,b) CONCAT2(a,b)
static const int the_answer = CONCAT(0x, 2a);

So there is no "one answer fits all".

来源：https://stackoverflow.com/questions/19278859/lex-identifier-vs-integer

标签

regex

lex