Extracting comments from Python Source Code

后端 未结 1 660
执念已碎
执念已碎 2021-01-04 10:09

I\'m trying to write a program to extract comments in code that user enters. I tried to use regex, but found it difficult to write.

Then I found a post here. The ans

1条回答
  •  天涯浪人
    2021-01-04 10:57

    Answer for more general cases (extracting from modules, functions):

    Modules:

    The documentation specifies that one needs to provide a callable which exposes the same interface as the readline() method of built-in file objects. This hints to: create an object that provides that method.

    In the case of module, we can just open a new module as a normal file and pass in it's readline method. This is the key, the argument you pass is the method readline().

    Given a small scrpt.py file with:

    # My amazing foo function.
    def foo():
        """ docstring """
        # I will print
        print "Hello"
        return 0   # Return the value
    
    # Maaaaaaain
    if __name__ == "__main__":
        # this is main
        print "Main" 
    

    We will open it as we do all files:

    fileObj = open('scrpt.py', 'r')
    

    This file object now has a method called readline (because it is a file object) which we can safely pass to tokenize.generate_tokens and create a generator.

    tokenize.generate_tokens (simply tokenize.tokenize in Py3 -- Note: Python 3 requires readline return bytes so you'll need to open the file in 'rb' mode) returns a named tuple of elements which contain information about the elements tokenized. Here's a small demo:

    for toktype, tok, start, end, line in tokenize.generate_tokens(fileObj.readline):
        # we can also use token.tok_name[toktype] instead of 'COMMENT'
        # from the token module 
        if toktype == tokenize.COMMENT:
            print 'COMMENT' + " " + tok
    

    Notice how we pass the fileObj.readline method to it. This will now print:

    COMMENT # My amazing foo function
    COMMENT # I will print
    COMMENT # Return the value
    COMMENT # Maaaaaaain
    COMMENT # this is main 
    

    So all comments regardless of position are detected. Docstrings of course are excluded.

    Functions:

    You could achieve a similar result without open for cases which I really can't think of. Nonetheless, I'll present another way of doing it for completeness sake. In this scenario you'll need two additional modules, inspect and StringIO (io.StringIO in Python3):

    Let's say you have the following function:

    def bar():
        # I am bar
        print "I really am bar"
        # bar bar bar baaaar
        # (bar)
        return "Bar"
    

    You need a file-like object which has a readline method to use it with tokenize. Well, you can create a file-like object from an str using StringIO.StringIO and you can get an str representing the source of the function with inspect.getsource(func). In code:

    funcText = inpsect.getsource(bar)
    funcFile = StringIO.StringIO(funcText)
    

    Now we have a file-like object representing the function which has the wanted readline method. We can just re-use the loop we previously performed replacing fileObj.readline with funcFile.readline. The output we get now is of similar nature:

    COMMENT # I am bar
    COMMENT # bar bar bar baaaar
    COMMENT # (bar)
    

    As an aside, if you really want to create a custom way of doing this with re take a look at the source for the tokenize.py module. It defines certain patters for comments, (r'#[^\r\n]*') names et cetera, loops through the lines with readline and searches within the line list for pattterns. Thankfully, it's not too complex after you look at it for a while :-).


    Answer for function extract (Update):

    You've created an object with StringIO that provides the interface but have you haven't passed that intereface (readline) to tokenize.generate_tokens, instead, you passed the full object (stringio).

    Additionally, in your else clause a TypeError is going to be raised because untokenize expects an iterable as input. Making the following changes, your function works fine:

    def extract(code):
        res = []
        comment = None
        stringio = StringIO.StringIO(code)
        # pass in stringio.readline to generate_tokens
        for toktype, tokval, begin, end, line in tokenize.generate_tokens(stringio.readline):
            if toktype != tokenize.COMMENT:
                res.append((toktype, tokval))
            else:
                # wrap (toktype, tokval) tupple in list
                print tokenize.untokenize([(toktype, tokval)])
        return tokenize.untokenize(res)
    

    Supplied with input of the form expr = extract('a=1+2#A comment') the function will print out the comment and retain the expression in expr:

    expr = extract('a=1+2#A comment')
    #A comment
    
    print expr
    'a =1 +2 '
    

    Furthermore, as I later mention io houses StringIO for Python3 so in this case the import is thankfully not required.

    0 讨论(0)
提交回复
热议问题