How do I implement a lexer given that I have already implemented a basic regular expression matcher?

前端 未结 2 822
太阳男子
太阳男子 2021-02-10 05:00

I\'m trying to implement a lexer for fun. I have already implemented a basic regular expression matcher(by first converting a pattern to a NFA and then to a DFA). Now I\'m cluel

2条回答
  •  灰色年华
    2021-02-10 05:53

    Assuming you have a working regex, regex_match which returns a boolean (True if a string satisfies the regex). First, you need to have an ordered list of tokens (with regex for each) tokens_regex, the order being important as order will prescribe precedence.

    One algorithm could be (this is not necessarily the only one):

    1. Write a procedure next_token which takes a string, and returns the first token, its value and the remaining string (or - if an illegal/ignore character - None, the offending character and the remaining string). Note: this has to respect precedence, and should find the longest token.
    2. Write a procedure lex which recursively calls next_token.

    .

    Something like this (written in Python):

    tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence
    
    def next_token( remaining_string ):
        for t_name, t_regex in tokens_regex: # check over in order of precedence
            for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method).
                if regex_match( remaining_string[:i], t_regex ):
                    return t_name, remaining_string[:i], remaining_string[i:]
        return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character
    
    def lex( string ):
        tokens_so_far = []
        remaining_string = string
        while len(remaining_string) > 0:
            t_name, t_value, string_remaining = next_token(remaining_string)
            if t_name is not None:
                tokens_so_far.append(t_name, t_value)
            #elif not regex_match(t_value,ignore_regex):
                #check against ignore regex, if not in it add to an error list/illegal characters
       return tokens_so_far
    

    Some things to add to improve your lexer: ignore regex, error lists and locations/line numbers (for these errors or for tokens).

    Have fun! And good luck making a parser :).

提交回复
热议问题