Pythonic way to implement a tokenizer

前端未结

关注

 12  652

青春惊慌失措

I\'m going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

I\'ve implemented a tokenizer before in C and in Java so I\'m

相关标签:

12条回答

南旧

2020-12-30 07:41

I'd turn to the excellent Text Processing in Python by David Mertz

0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2020-12-30 07:45

When I start something new in Python I usually look first at some modules or libraries to use. There's 90%+ chance that there already is somthing available.

For tokenizers and parsers this is certainly so. Have you looked at PyParsing ?

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-12-30 07:45

"Is there a better alternative to just simply returning a list of tuples"

I had to implement a tokenizer, but it required a more complex approach than a list of tuples, therefore I implemented a class for each token. You can then return a list of class instances, or if you want to save resources, you can return something implementing the iterator interface and generate the next token while you progress in the parsing.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-12-30 07:46

In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.

For generator see the original proposal or other online docs

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-12-30 07:47
I have recently built a tokenizer, too, and passed through some of your issues.

Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,
```
_INTEGER = 0x0007
_FLOAT = 0x0008
_VARIABLE = 0x0009
```
and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)

Tokens are returned as named tuples.
```
from collections import namedtuple
Token = namedtuple('Token', ['value', 'type'])
# so that e.g. somewhere in a function/method I can write...
t = Token(n, _INTEGER)
# ...and return it properly
```
I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).

Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.
```
def open_reader(self, source):
    """
    Produces a file object from source.
    The source can be either a file object already, or a string.
    """
    if hasattr(source, 'read'):
        return source
    else:
        from io import StringIO
        return StringIO(source)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2020-12-30 07:48

This being a late answer, there is now something in the official documentation: Writing a tokenizer with the re standard library. This is content in the Python 3 documentation that isn't in the Py 2.7 docs. But it is still applicable to older Pythons.

This includes both short code, easy setup, and writing a generator as several answers here have proposed.

If the docs are not Pythonic, I don't know what is :-)

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2