Tokenizer with Pygments in Python

大城市里の小女人 提交于 2019-12-08 07:50:58

问题


Want to create a tokenizer for source files (e.g. Java or C++) in Python. Came across Pygments and in particular these lexers. I could not found examples i the documentation and online for how to use the lexer.

Wondering if it is possible to actually use Pygments in Python in order to get the tokens and their position for a given source file.

I am struggling with the very basics here, so If someone could offer even a small chunk of code detailing the above it would be much appreciated.


回答1:


If you look at the source of Pygment's highlight function, essentially what it does is pass the source text into a lexer instance via the get_tokens method, which returns a list of tokens. Those tokens are then passed to the formatter. As you want the list of tokens, without the formatter, you only need to do the first part.

So to use the C++ lexer (where src is a string containing your C++ source code):

from pygments.lexers.c_cpp import CppLexer

lexer = CppLexer()
tokens = lexer.get_tokens(src)

Of course, you could lookup or guess the lexer instead of importing the desired lexer directly by using one of get_lexer_by_name, get_lexer_for_filename, get_lexer_for_mimetype, guess_lexer, or guess_lexer_for_filename. For example:

from pygments.lexers import get_lexer_by_name

Lexer = get_lexer_by_name('c++')
lexer = Lexer()  # Don't forget to create an instance
tokens = lexer.get_tokens(src)

Whether the returned list of tokens will provide you with what you want it another matter. You'll have to try it and see.




回答2:


You probably want to use the tokenize module: https://docs.python.org/2/library/tokenize.html if you are going to tokenize Python code. Otherwise PyParsing creates lexers that are easy to understand...



来源:https://stackoverflow.com/questions/36801263/tokenizer-with-pygments-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!