tokenizing ints vs floats in lex/flex

问题

I'm teaching myself a little flex/bison for fun. I'm writing an interpreter for the 1975 version of MS Extended BASIC (Extended as in "has strings"). I'm slightly stumped by one issue though.

Floats can be identified by looking for a . or an E (etc), and then fail over to an int otherwise. So I did this...

[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
                      yylval.d = atof(yytext);
                      return FLOAT;
                    }
[0-9]+ {
                      yylval.i = atoi(yytext);
                      return INT;
                    }

sub-fields in the yylval union are .d for double, .i for int and .s for string.

But it is also possible that you need to use a float because the number is too large to store in an int - which in this case is a 16-bit signed.

Is there a way to do this in the regex? Or do I have to do this in the associated c-side code with an if?

回答1:

If you want integer to take priority over float (so that a literal which looks like an integer is an integer), then you need to put the integer pattern first. (The pattern with the longest match always wins, but if two patterns both match the same longest prefix, the first one wins.) So your basic outline is:

integer-pattern     { /* integer rule */ }
float-pattern       { /* float rule */ }

Your float rule looks reasonable, but note that it will match a single ., possibly followed by an exponent. Very few languages consider a lone . as a floating point constant (that literal is conventionally written as 0 :-) ) So you might want to change it to something like

[0-9]*([0-9]\.?|\.[0-9])[0-9]*([Ee][-+]?[0-9]+)

To use a regex to match a non-negative integer which fits into a 16-bit signed int, you can use the following ugly pattern:

0*([12]?[0-9]{1,4}|3(2(7(6[0-7]|[0-5][0-9])|[0-6][0-9]{2})|[0-1][0-9]{3}))

(F)lex will produce efficient code to implement this regex, but that doesn't necessarily make it a good idea.

Notes:

The pattern recognises integers with redundant leading zeros, like 09. Some languages (like C) consider that to be an invalid octal literal, but I don't think Basic has that restriction.
The pattern does not recognise 32768, since that's too big to be a positive integer. However, it is not too big to be a negative integer; -32768 would be perfectly fine. This is always a corner case in parsing integer literals. If you were just lexing integer literals, you could easily handle the difference between positive and negative limits by having a separate pattern for literals starting with a -, but including the sign in the integer literal is not appropriate for expression parsers, since it produces an incorrect lexical analysis of a-1. (It would also be a bit weird for -32768 to be a valid integer literal, while - 32768 is analysed as a floating point expression which evaluates to -32768.0.) There's really no good solution here, unless your language includes unsigned integer literals (like C), in which case you could analyse literals from 0 to 32767 as signed integers; from 32768 to 65535 as unsigned integers; and from 65536 and above as floating point.

回答2:

The literals for integer and floating point numbers are the same for many programming languages. For example, the Java Language Specification (and several others) contains the grammar rules for internet and floating-point literals. In these rules, 0 does not validate as a floating point literal. That's the main problem I see with your current approach.

When parsing, you should not use atoi or atof since they don't check for errors. Use strtoul and strtod instead.

The action for integer numbers should be:

if strtoul succeeds:
    if the number is less than 0x8000:
        llval.i = number
        return INT
strtod must succeed
llval.d = number
return FLOAT

来源：https://stackoverflow.com/questions/57500790/tokenizing-ints-vs-floats-in-lex-flex

标签

regex

lex