Tokenizer with Pygments in Python

问题

Want to create a tokenizer for source files (e.g. Java or C++) in Python. Came across Pygments and in particular these lexers. I could not found examples i the documentation and online for how to use the lexer.

Wondering if it is possible to actually use Pygments in Python in order to get the tokens and their position for a given source file.

I am struggling with the very basics here, so If someone could offer even a small chunk of code detailing the above it would be much appreciated.

回答1:

If you look at the source of Pygment's highlight function, essentially what it does is pass the source text into a lexer instance via the get_tokens method, which returns a list of tokens. Those tokens are then passed to the formatter. As you want the list of tokens, without the formatter, you only need to do the first part.

So to use the C++ lexer (where src is a string containing your C++ source code):

from pygments.lexers.c_cpp import CppLexer

lexer = CppLexer()
tokens = lexer.get_tokens(src)

Of course, you could lookup or guess the lexer instead of importing the desired lexer directly by using one of get_lexer_by_name, get_lexer_for_filename, get_lexer_for_mimetype, guess_lexer, or guess_lexer_for_filename. For example:

from pygments.lexers import get_lexer_by_name

Lexer = get_lexer_by_name('c++')
lexer = Lexer()  # Don't forget to create an instance
tokens = lexer.get_tokens(src)

Whether the returned list of tokens will provide you with what you want it another matter. You'll have to try it and see.

回答2:

You probably want to use the tokenize module: https://docs.python.org/2/library/tokenize.html if you are going to tokenize Python code. Otherwise PyParsing creates lexers that are easy to understand...

来源：https://stackoverflow.com/questions/36801263/tokenizer-with-pygments-in-python

标签

python

tokenize

pygments