Howto clean comments from raw sql file

后端 未结 5 1264
时光说笑
时光说笑 2021-02-19 19:50

I have problem with cleaning comments and empty lines from already existing sql file. The file has over 10k lines so cleaning it manually is not an option.

I have a litt

相关标签:
5条回答
  • 2021-02-19 20:24

    This is an extend of samplebias answer that work with your example :

    import sqlparse
    
    sql_example = """--comment
    SELECT * from test;
    INSERT INTO test VALUES ('
    -- test
    a
    ');
    """
    
    new_sql = []
    
    for statement in sqlparse.parse(sql_example):
        new_tockens = [stm for stm in statement.tokens 
                       if not isinstance(stm, sqlparse.sql.Comment)]
    
        new_statement = sqlparse.sql.TokenList(new_tockens)
        new_sql.append(new_statement.to_unicode())
    
    print sqlparse.format("\n".join(new_sql))
    

    Output:

    SELECT * from test;
    
    INSERT INTO test VALUES ('
    -- test
    a
    ');
    
    0 讨论(0)
  • 2021-02-19 20:28

    Try the sqlparse module.

    Updated example: leaving comments inside insert values, and comments within CREATE FUNCTION blocks. You can tweak further to tune the behavior:

    import sqlparse
    from sqlparse import tokens
    
    queries = '''
    CREATE FUNCTION func1(a integer) RETURNS void
        LANGUAGE plpgsql
            AS $$
            BEGIN
                    -- comment
           END;
           $$;
    SELECT -- comment
    * FROM -- comment
    TABLE foo;
    -- comment
    INSERT INTO foo VALUES ('a -- foo bar');
    INSERT INTO foo
    VALUES ('
    a 
    -- foo bar'
    );
    
    '''
    
    IGNORE = set(['CREATE FUNCTION',])  # extend this
    
    def _filter(stmt, allow=0):
        ddl = [t for t in stmt.tokens if t.ttype in (tokens.DDL, tokens.Keyword)]
        start = ' '.join(d.value for d in ddl[:2])
        if ddl and start in IGNORE:
            allow = 1
        for tok in stmt.tokens:
            if allow or not isinstance(tok, sqlparse.sql.Comment):
                yield tok
    
    for stmt in sqlparse.split(queries):
        sql = sqlparse.parse(stmt)[0]
        print sqlparse.sql.TokenList([t for t in _filter(sql)])
    

    Output:

    CREATE FUNCTION func1(a integer) RETURNS void
        LANGUAGE plpgsql
            AS $$
            BEGIN
                    -- comment
           END;
           $$;
    
    SELECT * FROM TABLE foo;
    
    INSERT INTO foo VALUES ('a -- foo bar');
    
    INSERT INTO foo
    VALUES ('
    a
    -- foo bar'
    );
    
    0 讨论(0)
  • 2021-02-19 20:32

    It is possible to do it with regular expressions. First you have to split the file by strings and after this you can split the file by comments. The following Perl program does it:

    #! /usr/bin/perl -w
    
    # Read hole file.
    my $file = join ('', <>);
    
    # Split by strings including the strings.
    my @major_parts = split (/('(?:[^'\\]++|\\.)*+')/, $file);
    
    foreach my $part (@major_parts) {
        if ($part =~ /^'/) {
            # Print the part if it is a string.
            print $part; 
        }
        else {
            # Split by comments removing the comments
            my @minor_parts = split (/^--.*$/m, $part);
            # Print the remaining parts.
            print join ('', @minor_parts);
        }
    }
    
    0 讨论(0)
  • 2021-02-19 20:37

    Adding an updated answer :)

    import sqlparse
    
    sql_example = """--comment
    SELECT * from test;
    INSERT INTO test VALUES ('
    -- test
    a
    ');
     """
    print sqlparse.format(sql_example, strip_comments=True).strip()
    

    Output:

    SELECT * from test;
    INSERT INTO test VALUES ('
    -- test
    a
    ');
    

    It achieves the same result but also covers all other corner cases and more concise

    0 讨论(0)
  • 2021-02-19 20:46
    # Remove comments i.e. lines beginning with whitespace and '--' (using multi-line flag)
    re.sub('^\s*--.*\n?', '', query, flags=re.MULTILINE)
    

    Regex string explained:

    • ^ start of line
    • \s whitespace
    • \s* zero or more whitespace characters
    • -- two hypens (static string pattern)
    • .* zero or more of any characters (i.e. the rest of the line)
    • \n newline character
    • ? end of string
    • flags = re.M is the multiline modifier

    "When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline)"

    See the Python regular expressions documentation for more details:

    https://docs.python.org/3/library/re.html

    0 讨论(0)
提交回复
热议问题