问题:

Is there a Python script or tool available which can remove comments and docstrings from Python source?

It should take care of cases like:

""" aas """ def f():     m = {         u'x':             u'y'         } # faake docstring ;)     if 1:         'string' >> m     if 2:         'string' , m     if 3:         'string' > m

So at last I have come up with a simple script, which uses the tokenize module and removes comment tokens. It seems to work pretty well, except that I am not able to remove docstrings in all cases. See if you can improve it to remove docstrings.

import cStringIO import tokenize  def remove_comments(src):     """     This reads tokens using tokenize.generate_tokens and recombines them     using tokenize.untokenize, and skipping comment/docstring tokens in between     """     f = cStringIO.StringIO(src)     class SkipException(Exception): pass     processed_tokens = []     last_token = None     # go thru all the tokens and try to skip comments and docstrings     for tok in tokenize.generate_tokens(f.readline):         t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok          try:             if t_type == tokenize.COMMENT:                 raise SkipException()              elif t_type == tokenize.STRING:                  if last_token is None or last_token[0] in [tokenize.INDENT]:                     # FIXEME: this may remove valid strings too?                     #raise SkipException()                     pass          except SkipException:             pass         else:             processed_tokens.append(tok)          last_token = tok      return tokenize.untokenize(processed_tokens)

Also I would like to test it on a very large collection of scripts with good unit test coverage. Can you suggest such a open source project?

回答1:

This does the job:

""" Strip comments and docstrings from a file. """  import sys, token, tokenize  def do_file(fname):     """ Run on just one file.      """     source = open(fname)     mod = open(fname + ",strip", "w")      prev_toktype = token.INDENT     first_line = None     last_lineno = -1     last_col = 0      tokgen = tokenize.generate_tokens(source.readline)     for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:         if 0:   # Change to if 1 to see the tokens fly by.             print("%10s %-14s %-20r %r" % (                 tokenize.tok_name.get(toktype, toktype),                 "%d.%d-%d.%d" % (slineno, scol, elineno, ecol),                 ttext, ltext                 ))         if slineno > last_lineno:             last_col = 0         if scol > last_col:             mod.write(" " * (scol - last_col))         if toktype == token.STRING and prev_toktype == token.INDENT:             # Docstring             mod.write("#--")         elif toktype == tokenize.COMMENT:             # Comment             mod.write("##\n")         else:             mod.write(ttext)         prev_toktype = toktype         last_col = ecol         last_lineno = elineno  if __name__ == '__main__':     do_file(sys.argv[1])

I'm leaving stub comments in the place of docstrings and comments since it simplifies the code. If you remove them completely, you also have to get rid of indentation before them.

回答2:

I'm the author of the "mygod, he has written a python interpreter using regex..." (i.e. pyminifier) mentioned at that link below =).
I just wanted to chime in and say that I've improved the code quite a bit using the tokenizer module (which I discovered thanks to this question =) ).

You'll be happy to note that the code no longer relies so much on regular expressions and uses tokenizer to great effect. Anyway, here's the remove_comments_and_docstrings() function from pyminifier
(Note: It works properly with the edge cases that previously-posted code breaks on):

import cStringIO, tokenize def remove_comments_and_docstrings(source):     """     Returns 'source' minus comments and docstrings.     """     io_obj = cStringIO.StringIO(source)     out = ""     prev_toktype = tokenize.INDENT     last_lineno = -1     last_col = 0     for tok in tokenize.generate_tokens(io_obj.readline):         token_type = tok[0]         token_string = tok[1]         start_line, start_col = tok[2]         end_line, end_col = tok[3]         ltext = tok[4]         # The following two conditionals preserve indentation.         # This is necessary because we're not using tokenize.untokenize()         # (because it spits out code with copious amounts of oddly-placed         # whitespace).         if start_line > last_lineno:             last_col = 0         if start_col > last_col:             out += (" " * (start_col - last_col))         # Remove comments:         if token_type == tokenize.COMMENT:             pass         # This series of conditionals removes docstrings:         elif token_type == tokenize.STRING:             if prev_toktype != tokenize.INDENT:         # This is likely a docstring; double-check we're not inside an operator:                 if prev_toktype != tokenize.NEWLINE:                     # Note regarding NEWLINE vs NL: The tokenize module                     # differentiates between newlines that start a new statement                     # and newlines inside of operators such as parens, brackes,                     # and curly braces.  Newlines inside of operators are                     # NEWLINE and newlines that start new code are NL.                     # Catch whole-module docstrings:                     if start_col > 0:                         # Unlabelled indentation means we're inside an operator                         out += token_string                     # Note regarding the INDENT token: The tokenize module does                     # not label indentation inside of an operator (parens,                     # brackets, and curly braces) as actual indentation.                     # For example:                     # def foo():                     #     "The spaces before this docstring are tokenize.INDENT"                     #     test = [                     #         "The spaces before this string do not get a token"                     #     ]         else:             out += token_string         prev_toktype = token_type         last_col = end_col         last_lineno = end_line     return out

回答3:

This recipe here claims to do what you want. And a few other things too.

回答4:

Try testing each chunk of tokens ending with NEWLINE. Then correct pattern for docstring (including cases where it serves as comment, but isn't assigned to __doc__) I believe is (assuming match is performed from start of file of after NEWLINE):

( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE

This should handle all tricky cases: string concatenation, line continuation, module/class/function docstrings, comment in the sameline after string. Note, there is a difference between NL and NEWLINE tokens, so we don't need to worry about single string of the line inside expression.

回答5:

I think the best way is using ast.

回答6:

I've just used the code given by Dan McDougall, and I've found two problems.

There were too many empty new lines, so I decided to remove line every time we have two consecutive new lines
When the Python code was processed all spaces were missing (except indentation) and so such things as "import Anything" changed into "importAnything" which caused problems. I added spaces after and before reserved Python words which needed it done. I hope I didn't make any mistake there.

I think I have fixed both things with adding (before return) few more lines:

# Removing unneeded newlines from string buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input content_without_newlines = "" previous_token_type = tokenize.NEWLINE for tokens in tokenize.generate_tokens(buffered_content.readline):     token_type = tokens[0]     token_string = tokens[1]     if previous_token_type == tokenize.NL and token_type == tokenize.NL:         pass     else:         # add necessary spaces         prev_space = ''         next_space = ''         if token_string in ['and', 'as', 'or', 'in', 'is']:             prev_space = ' '         if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:             next_space = ' '         content_without_newlines += prev_space + token_string + next_space # This will be our new output!     previous_token_type = token_type