I\'m going to implement a tokenizer in Python and I was wondering if you could offer some style advice?
I\'ve implemented a tokenizer before in C and in Java so I\'m
There's an undocumented class in the re
module called re.Scanner
. It's very straightforward to use for a tokenizer:
import re
scanner=re.Scanner([
(r"[0-9]+", lambda scanner,token:("INTEGER", token)),
(r"[a-z_]+", lambda scanner,token:("IDENTIFIER", token)),
(r"[,.]+", lambda scanner,token:("PUNCTUATION", token)),
(r"\s+", None), # None == skip token.
])
results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results
will result in
[('INTEGER', '45'),
('IDENTIFIER', 'pigeons'),
('PUNCTUATION', ','),
('INTEGER', '23'),
('IDENTIFIER', 'cows'),
('PUNCTUATION', ','),
('INTEGER', '11'),
('IDENTIFIER', 'spiders'),
('PUNCTUATION', '.')]
I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.
"Is there a better alternative to just simply returning a list of tuples?"
That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.
Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.
I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:
Both are generators. The benefits of this approach were:
I feel quite happy with this layered approach.
Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):
class Tokenizer(object):
def __init__(self,file):
self.file = file
def __get_next_character(self):
return self.file.read(1)
def __peek_next_character(self):
character = self.file.read(1)
self.file.seek(self.file.tell()-1,0)
return character
def __read_number(self):
value = ""
while self.__peek_next_character().isdigit():
value += self.__get_next_character()
return value
def next_token(self):
character = self.__peek_next_character()
if character.isdigit():
return self.__read_number()
"Is there a better alternative to just simply returning a list of tuples?"
Nope. It works really well.