Parsing tokens with PLY

问题

I've been trying to parse some given text with PLY for a while and I haven't been able to figure it out. I have these tokens defined:

tokens = ['ID', 'INT', 'ASSIGNMENT']

And I want to classify the words I find into these tokens. For example, if the scanner is given:

var = 5

It should print this:

ID : 'var'
ASSIGNMENT : '='
INT : 5

This works just fine. The problem is when the program is given the following text:

9var = 5

The output for this would be:

INT : 9
ID : 'var'
ASSIGNMENT : '='
INT : 5

This is where it goes wrong. It should take 9var as an ID, and according to the ID regex, that is not a valid name for an ID. These are my regular expressions:

def t_ID(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*' 
    return t

def t_INT(t):
    r'\d+'
    t.value = int(t.value)
    return t

t_ASSIGNMENT = r'\='

How can I fix this?

Your help would be appreciated!

回答1:

You say: "It should take 9var as an ID". But then you point out that 9var doesn't match the ID regex pattern. So why should 9var be scanned as an ID?

If you want 9var to be an ID, it would be easy enough to change the regex, from [a-zA-Z_][a-zA-Z_0-9]* to [a-zA-Z_0-9]+. (That will also match pure integers, so you'd need to ensure that the INT pattern is applied first. Alternatively, you could use [a-zA-Z_0-9]*[a-zA-Z_][a-zA-Z_0-9]*.)

I suspect that what you really want is for 9var to be recognized as a lexical error rather than a parsing error. But if it is going to be recognized as an error in any case, does it really matter whether it is a lexical error or a syntax error?

It's worth mentioning that the Python lexer works exactly the way your lexer does: it will scan 9var as two tokens, and that will later create a syntax error.

Of course, it is possible that in your language, there is some syntactically correct construction in which an ID can directly follow an INT. Or, if not, where a keyword can directly follow an INT, such as the Python expression 3 if x else 2. (Again, Python doesn't complain if you write that as 3if x else 2.)

So if you really really insist on flagging a scanner error for tokens which start with a digit and continue with non-digits, you can insert another pattern, such as [0-9]+[a-zA-Z_][a-zA-Z_0-9]*, and have it raise an error in its action.

来源：https://stackoverflow.com/questions/30118046/parsing-tokens-with-ply

标签

python

regex

parsing

token

ply