Simple tokenizer for C++ in Python

问题

Struggling to find a Python library of script to tokenize (find specific tokens like function definition names, variable names, keywords etc.).

I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc. I was hoping of using a pre-existent script; I explored Pygments with no success. Its lexer seems amazing for what I want but have no idea how to utilize it in Python and to also get positions for each found token.

For example I am looking at doing something like that:

int fac(int n)
{
    return (n>1) ? n∗fac(n−1) : 1;
}

from the source code above I would like to get:

function_name: 'fac' at position (x, y) variable_name: 'n' at position (x, y+8)

EDITED: Any suggestions will be appreciated since I am in the dark here regarding tokenizations and parsing in C++?

回答1:

Eli Bendersky is a smart guy, and sometimes active here on SO. He's got a blog post on this issue which I'll refer you directly to: Parsing C++ in Python with Clang.

Because things disappear, here's the takeaway:

Eli Bendersky wrote a C language (not C++) parser in Python, called pycparser. People keep asking him if he's going to add support for C++. He is not. He recommends instead that people use the Python bindings for libclang to get access to "a C API that the Clang team vows to keep relatively stable, allowing the user to examine parsed code at the level of an abstract syntax tree (AST)".

You can find the bindings separately on PyPI here. Note though that you'll have to have clang installed, so you may just want to point your PYTHON_PATH directly at the install location.

回答2:

You're struggling to find a python library to do what you want because what you want is impossible to do, fundamentally.

I have managed to find keywords, whitespaces etc. using something like this but I found it quite a challenge for function/class definition names etc

You mean like this:

foo = 3
def foo():pass

What is foo? All a tokenizer should/can tell you is that foo is an identifier. It's context tells you whether it's a variable or a function declaration. You need a parser to handle context free grammars. Mathematically, the space of context free grammars is too large for a standard lexer to tackle.

Try a parser: here's one in python

Normally I'd try and provide you links here to distinguish between the topics, but this is too broad to provide a single good link to. If you're interested, start with any standard compiler text. Elsewhere on SE, we see this question pop up as a theoretical question and, in some form, as a famous question about html.

Once you realize that tokenizers are (usually) built (largely) on regular expressions, it becomes more obvious why your task is not going to end happily.

Now that you know the terminology, I think you'll find this SO article useful, which recommends gcc-ml. I don't know how up-to-date it is, but it's the type of program you're looking for.

来源：https://stackoverflow.com/questions/36802298/simple-tokenizer-for-c-in-python

标签

python

tokenize