Pythonic way to implement a tokenizer

前端未结

关注

 12  651

青春惊慌失措

I\'m going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

I\'ve implemented a tokenizer before in C and in Java so I\'m

相关标签:

12条回答

终归单人心

2020-12-30 07:24

There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

will result in

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.

0 讨论(0)

不知归路

2020-12-30 07:27

"Is there a better alternative to just simply returning a list of tuples?"

That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2020-12-30 07:29

Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.

0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2020-12-30 07:32
I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:
- a surface scanner: This one actually reads the text and uses regular expression to split it up into only the most primitve tokens (operators, identifiers, numbers,...); this one yields tuples (tokenname, scannedstring, startpos, endpos).
- a tokenizer: This consumes the tuples from the first layer, turning them into token objects (named tuples would do as well, I think). Its purpose is to detect some long-range dependencies in the token stream, particularly strings (with their opening and closing quotes) and comments (with their opening an closing lexems; - yes, I wanted to retain comments!) and coerce them into single tokens. The resulting stream of token objects is then returned to a consuming parser.
Both are generators. The benefits of this approach were:
- Reading of the raw text is done only in the most primitive way, with simple regexps - fast and clean.
- The second layer is already implemented as a primitive parser, to detect string literals and comments - re-use of parser technology.
- You don't have to strain the surface scanner with complex detections.
- But the real parser gets tokens on the semantic level of the language to be parsed (again strings, comments).
I feel quite happy with this layered approach.
0 讨论(0)
发布评论:

提交评论
- 加载中...

不知归路

2020-12-30 07:41

Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):

class Tokenizer(object):

  def __init__(self,file):
     self.file = file

  def __get_next_character(self):
      return self.file.read(1)

  def __peek_next_character(self):
      character = self.file.read(1)
      self.file.seek(self.file.tell()-1,0)
      return character

  def __read_number(self):
      value = ""
      while self.__peek_next_character().isdigit():
          value += self.__get_next_character()
      return value

  def next_token(self):
      character = self.__peek_next_character()

      if character.isdigit():
          return self.__read_number()

0 讨论(0)

借酒劲吻你

2020-12-30 07:41

"Is there a better alternative to just simply returning a list of tuples?"

Nope. It works really well.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页