How do I implement a lexer given that I have already implemented a basic regular expression matcher?

前端 未结 2 821
太阳男子
太阳男子 2021-02-10 05:00

I\'m trying to implement a lexer for fun. I have already implemented a basic regular expression matcher(by first converting a pattern to a NFA and then to a DFA). Now I\'m cluel

相关标签:
2条回答
  • 2021-02-10 05:53

    Assuming you have a working regex, regex_match which returns a boolean (True if a string satisfies the regex). First, you need to have an ordered list of tokens (with regex for each) tokens_regex, the order being important as order will prescribe precedence.

    One algorithm could be (this is not necessarily the only one):

    1. Write a procedure next_token which takes a string, and returns the first token, its value and the remaining string (or - if an illegal/ignore character - None, the offending character and the remaining string). Note: this has to respect precedence, and should find the longest token.
    2. Write a procedure lex which recursively calls next_token.

    .

    Something like this (written in Python):

    tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence
    
    def next_token( remaining_string ):
        for t_name, t_regex in tokens_regex: # check over in order of precedence
            for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method).
                if regex_match( remaining_string[:i], t_regex ):
                    return t_name, remaining_string[:i], remaining_string[i:]
        return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character
    
    def lex( string ):
        tokens_so_far = []
        remaining_string = string
        while len(remaining_string) > 0:
            t_name, t_value, string_remaining = next_token(remaining_string)
            if t_name is not None:
                tokens_so_far.append(t_name, t_value)
            #elif not regex_match(t_value,ignore_regex):
                #check against ignore regex, if not in it add to an error list/illegal characters
       return tokens_so_far
    

    Some things to add to improve your lexer: ignore regex, error lists and locations/line numbers (for these errors or for tokens).

    Have fun! And good luck making a parser :).

    0 讨论(0)
  • 2021-02-10 05:55

    I've done pretty much the same thing. The way I did it was to combine all the expressions in one pretty big NFA and converted that same thing into one DFA. When doing that keep track of the states that previously were accepting states in each corresponding original DFA and their precedence.

    The generated DFA will have many states that are accepting states. You run this DFA until it recieves a character that it has no corresponding transitions for. If the DFA is then in an accepting state you will then look at which of your original NFAs that had that accepting state in them. The one that has the highest precedence is the token you're going to return.

    This does not handle regular expression lookaheads. These are typically not really needed for lexer work anyway. That would be the job of the parser.

    Such a lexer runs in much the same speed as a single regular expression since there is basically only one DFA for it to run. You can omit converting the NFA altogether for a faster-to-construct algorithm but slower to run. The algorithm is basically the same.

    The source code for the lexer I wrote is freely available on github, should you want to see how I did it.

    0 讨论(0)
提交回复
热议问题