问题
I have no idea how/where to start. I'm supposed to be using python, and more specifically, the ply library. So far, all I've done in create a list of tokens that will be part of the language. That list is given below:
tokens = (
# OPERATORS #
'PLUS' , # +
'MINUS' , # -
'MULTIPLY', # *
'DIVIDE', # /
'MODULO', # %
'NOT', # ~
'EQUALS', # =
# COMPARATORS #
'LT', # <
'GT', # >
'LTE', # <=
'GTE', # >=
'DOUBLEEQUAL', # ==
'NE', # #
'AND', # &
'OR', # |
# CONDITIONS AND LOOPS #
'IF', # if
'ELSE', # else
'ELSEIF', # elseif
'WHILE', # while
'FOR', # for
# 'DOWHILE', # haven't thought about this yet
# BRACKETS #
'LPAREN', # (
'RPAREN', # )
'LBRACE', # [
'RBRACE', # ]
'BLOCKSTART', # {
'BLOCKEND', # }
# IDENTIFIERS #
'INTEGER', # int
'DOUBLE', # dbl
'STRING', # str
'CHAR', # char
'SEMICOLON', # ;
'DOT', # .
'COMMA', # ,
'QUOTES', # '
'DOUBLEQUOTES', # "
'COMMENTLINE', # --
'RETURN', # return
)
I've obviously got a long way to go, seeing as I also need to write a parser and an interpreter.
I've got a few questions:
- How do I use the ply library?
- Is this a good start, and if so, what do I go from this?
- Are there any resources I can use to help me with this.
I've tried googling stuff on writing new programming languages, but I haven't yet found anything satisfactory
回答1:
How do I use the ply library?
Assuming that you already have Ply installed, you should start with exploring the tutorials on the official Ply website. They are well written and easy to follow.
Is this a good start, and if so, what do I go from this?
Ply requires token definitions to begin with. You have already done that. However, the complexities increase when your lexer has to differentiate between say a string like "forget" and a reserved keyword like for
. The library provides good support for variable precedence to resolve grammar ambiguity. This can be as easy as defining the precedence as tuples:
precedence = (
('left', 'STRING', 'KEYWORD'),
('left', 'MULTIPLY', 'DIVIDE')
)
However, I recommend you should read more about lexers and yacc before deep diving into the more advanced features like expressions and precedence in Ply. For a start, you should build a simple numerical lexer that successfully parses integers, operators and bracket symbols. I've reduced the token definition to suit this purpose. The following example has been modified from the official tutorials.
Library import & Token definition:
import ply.lex as lex #library import # List of token names. This is always required tokens = [ # OPERATORS # 'PLUS' , # + 'MINUS' , # - 'MULTIPLY', # * 'DIVIDE', # / 'MODULO', # % 'NOT', # ~ 'EQUALS', # = # COMPARATORS # 'LT', # < 'GT', # > 'LTE', # <= 'GTE', # >= 'DOUBLEEQUAL', # == 'NE', # != 'AND', # & 'OR' , # | # BRACKETS # 'LPAREN', # ( 'RPAREN', # ) 'LBRACE', # [ 'RBRACE', # ] 'BLOCKSTART', # { 'BLOCKEND', # } # DATA TYPES# 'INTEGER', # int 'FLOAT', # dbl 'COMMENT', # -- ]
Define regular expression rules for simple tokens: Ply uses the
re
Python library to find regex matches for tokenization. Each token requires a regex definition. We first define regex definitions for simple tokens. Each rule declaration begins with the special prefixt_
to indicate that it defines a token.# Regular expression rules for simple tokens t_PLUS = r'\+' t_MINUS = r'-' t_MULTIPLY = r'\*' t_DIVIDE = r'/' t_MODULO = r'%' t_LPAREN = r'\(' t_RPAREN = r'\)' t_LBRACE = r'\[' t_RBRACE = r'\]' t_BLOCKSTART = r'\{' t_BLOCKEND = r'\}' t_NOT = r'\~' t_EQUALS = r'\=' t_GT = r'\>' t_LT = r'\<' t_LTE = r'\<\=' t_GTE = r'\>\=' t_DOUBLEEQUAL = r'\=\=' t_NE = r'\!\=' t_AND = r'\&' t_OR = r'\|' t_COMMENT = r'\#.*' t_ignore = ' \t' ignore spaces and tabs
Define regular expression rules for more complex tokens like data types such as int, float and newline characters to track line numbers. You will notice that these definitions are quite similar to the above.
#Rules for INTEGER and FLOAT tokens def t_INTEGER(t): r'\d+' t.value = int(t.value) return t def t_FLOAT(t): r'(\d*\.\d+)|(\d+\.\d*)' t.value = float(t.value) return t # Define a rule so we can track line numbers def t_newline(t): r'\n+' t.lexer.lineno += len(t.value)
Add some error handling for invalid characters:
# Error handling rule def t_error(t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1)
Build the lexer:
lexer = lex.lex()
Test the lexer with some input data, tokenize and print tokens:
data = ''' [25/(3*40) + {300-20} -16.5] {(300-250)<(400-500)} 20 & 30 | 50 # This is a comment ''' # Give the lexer some input lexer.input(data) # Tokenize for tok in lexer: print(tok)
You can add this example code to a Python script file like new_lexer.py
and run it like python new_lexer.py
. You should get the following output. Note that the input data consisted of newline('\n'
) characters that were successfully ignored in the output.
#Output
LexToken(LBRACE,'[',2,1)
LexToken(INTEGER,25,2,2)
LexToken(DIVIDE,'/',2,4)
LexToken(LPAREN,'(',2,5)
LexToken(INTEGER,3,2,6)
LexToken(MULTIPLY,'*',2,7)
LexToken(INTEGER,40,2,8)
LexToken(RPAREN,')',2,10)
LexToken(PLUS,'+',2,12)
LexToken(BLOCKSTART,'{',2,14)
LexToken(INTEGER,300,2,15)
LexToken(MINUS,'-',2,18)
LexToken(INTEGER,20,2,19)
LexToken(BLOCKEND,'}',2,21)
LexToken(MINUS,'-',2,23)
LexToken(INTEGER,16,2,24)
LexToken(FLOAT,0.5,2,26)
LexToken(RBRACE,']',2,28)
LexToken(BLOCKSTART,'{',3,30)
LexToken(LPAREN,'(',3,31)
LexToken(INTEGER,300,3,32)
LexToken(MINUS,'-',3,35)
LexToken(INTEGER,250,3,36)
LexToken(RPAREN,')',3,39)
LexToken(LT,'<',3,40)
LexToken(LPAREN,'(',3,41)
LexToken(INTEGER,400,3,42)
LexToken(MINUS,'-',3,45)
LexToken(INTEGER,500,3,46)
LexToken(RPAREN,')',3,49)
LexToken(BLOCKEND,'}',3,50)
LexToken(INTEGER,20,4,52)
LexToken(AND,'&',4,55)
LexToken(INTEGER,30,4,57)
LexToken(OR,'|',4,60)
LexToken(INTEGER,50,4,62)
LexToken(COMMENT,'# This is a comment',5,65)
There are many other features you can make use of. For instance, debugging can be enabled with lex.lex(debug=True)
. The official tutorials provide more detailed information around these features.
I hope this helps to get you started. You can extend the code further to include reserved keywords like if
, while
and string identification with STRING
, character identification with CHAR
. The tutorials cover the implementation of reserved words by defining a key-value dictionary mapping like this:
reserved = {
'if' : 'IF',
'then' : 'THEN',
'else' : 'ELSE',
'while' : 'WHILE',
...
}
extending the tokens list further by defining the reserved token type
as 'ID'
and including the reserved
dict values: tokens.append('ID')
and tokens = tokens + list(reserved.values())
. Then, add a definition for t_ID
as above.
Are there any resources I can use to help me with this.
There are many resources available to learn about lexers, parsers and compilers. You should start with a good book that covers the theory and implementation. There are many books available that cover these topics. I liked this one. Here's another resource that may help. If you'd like to explore similar Python libraries or resources, this SO answer may help.
来源:https://stackoverflow.com/questions/55571086/writing-a-lexer-for-a-new-programming-language-in-python