Regex to remove single-line SQL comments (--)

前端 未结 7 987
悲哀的现实
悲哀的现实 2021-01-15 08:52

Question:

Can anybody give me a working regex expression (C#/VB.NET) that can remove single line comments from a SQL statement ?

I mean these comments:

相关标签:
7条回答
  • 2021-01-15 09:15

    I will disappoint all of you. This can't be done with regular expressions. Sure, it's easy to find comments not in a string (that even the OP could do), the real deal is comments in a string. There is a little hope of the look arounds, but that's still not enough. By telling that you have a preceding quote in a line won't guarantee anything. The only thing what guarantees you something is the oddity of quotes. Something you can't find with regular expression. So just simply go with non-regular-expression approach.

    EDIT: Here's the c# code:

            String sql = "--this is a test\r\nselect stuff where substaff like '--this comment should stay' --this should be removed\r\n";
            char[] quotes = { '\'', '"'};
            int newCommentLiteral, lastCommentLiteral = 0;
            while ((newCommentLiteral = sql.IndexOf("--", lastCommentLiteral)) != -1)
            {
                int countQuotes = sql.Substring(lastCommentLiteral, newCommentLiteral - lastCommentLiteral).Split(quotes).Length - 1;
                if (countQuotes % 2 == 0) //this is a comment, since there's an even number of quotes preceding
                {
                    int eol = sql.IndexOf("\r\n") + 2;
                    if (eol == -1)
                        eol = sql.Length; //no more newline, meaning end of the string
                    sql = sql.Remove(newCommentLiteral, eol - newCommentLiteral);
                    lastCommentLiteral = newCommentLiteral;
                }
                else //this is within a string, find string ending and moving to it
                {
                    int singleQuote = sql.IndexOf("'", newCommentLiteral);
                    if (singleQuote == -1)
                        singleQuote = sql.Length;
                    int doubleQuote = sql.IndexOf('"', newCommentLiteral);
                    if (doubleQuote == -1)
                        doubleQuote = sql.Length;
    
                    lastCommentLiteral = Math.Min(singleQuote, doubleQuote) + 1;
    
                    //instead of finding the end of the string you could simply do += 2 but the program will become slightly slower
                }
            }
    
            Console.WriteLine(sql);
    

    What this does: find every comment literal. For each, check if it's within a comment or not, by counting the number of quotes between the current match and the last one. If this number is even, then it's a comment, thus remove it (find first end of line and remove whats between). If it's odd, this is within a string, find the end of the string and move to it. Rgis snippet is based on a wierd SQL trick: 'this" is a valid string. Even tho the 2 quotes differ. If it's not true for your SQL language, you should try a completely different approach. I'll write a program to that too if that's the case, but this one's faster and more straightforward.

    0 讨论(0)
  • 2021-01-15 09:16

    You want something like this for the simple case

    -{2,}.*
    

    The -{2,} looks for a dash that happens 2 or more times

    The .* gets the rest of the lines up to the newline

    *But, for the edge cases, it appears that SinistraD is correct in that you cannot catch everything, however here is an article about how this can be done in C# with a combination of code and regex.

    0 讨论(0)
  • 2021-01-15 09:18

    I don't know if C#/VB.net regex is special in some way but traditionally s/--.*// should work.

    0 讨论(0)
  • 2021-01-15 09:19

    This seems to work well for me so far; it even ignores comments within strings, such as SELECT '--not a comment--' FROM ATable

        private static string removeComments(string sql)
        {
            string pattern = @"(?<=^ ([^'""] |['][^']*['] |[""][^""]*[""])*) (--.*$|/\*(.|\n)*?\*/)";
            return Regex.Replace(sql, pattern, "", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
        }
    

    Note: it is designed to eliminate both /**/-style comments as well as -- style. Remove |/\*(.|\n)*?\*/ to get rid of the /**/ checking. Also be sure you are using the RegexOptions.IgnorePatternWhitespace Regex option!!

    I wanted to be able to handle double-quotes too, but since T-SQL doesn't support them, you could get rid of |[""][^""]*[""] too.

    Adapted from here.

    Note (Mar 2015): In the end, I wound up using Antlr, a parser generator, for this project. There may have been some edge cases where the regex didn't work. In the end I was much more confident with the results having used Antlr, and it's worked well.

    0 讨论(0)
  • 2021-01-15 09:25
    Using System.Text.RegularExpressions;
    
    public static string RemoveSQLCommentCallback(Match SQLLineMatch)
    {
        System.Text.StringBuilder sb = new System.Text.StringBuilder();
        bool open = false; //opening of SQL String found
        char prev_ch = ' ';
    
        foreach (char ch in SQLLineMatch.ToString())
        {
            if (ch == '\'')
            {
                open = !open;
            }
            else if ((!open && prev_ch == '-' && ch == '-'))
            {
                break;
            }
            sb.Append(ch);
            prev_ch = ch;
        }
    
        return sb.ToString().Trim('-');
    }
    

    The code

    public static void Main()
    {
        string sqlText = "WHERE DEPT_NAME LIKE '--Test--' AND START_DATE < SYSDATE -- Don't go over today";
        //for every matching line call callback func
        string result = Regex.Replace(sqlText, ".*--.*", RemoveSQLCommentCallback);
    }
    

    Let's replace, find all the lines that match dash dash comment and call your parsing function for every match.

    0 讨论(0)
  • 2021-01-15 09:27

    As a late solution, the simplest way is to do it using ScriptDom-TSqlParser:

    // https://michaeljswart.com/2014/04/removing-comments-from-sql/
    // http://web.archive.org/web/*/https://michaeljswart.com/2014/04/removing-comments-from-sql/
    public static string StripCommentsFromSQL(string SQL)
    {
        Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser parser = 
            new Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser(true);
    
        System.Collections.Generic.IList<Microsoft.SqlServer.TransactSql.ScriptDom.ParseError> errors;
    
    
        Microsoft.SqlServer.TransactSql.ScriptDom.TSqlFragment fragments = 
            parser.Parse(new System.IO.StringReader(SQL), out errors);
    
        // clear comments
        string result = string.Join(
          string.Empty,
          fragments.ScriptTokenStream
              .Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.MultilineComment)
              .Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.SingleLineComment)
              .Select(x => x.Text));
    
        return result;
    
    }
    

    or instead of using the Microsoft-Parser, you can use ANTL4 TSqlLexer

    or without any parser at all:

    private static System.Text.RegularExpressions.Regex everythingExceptNewLines = 
        new System.Text.RegularExpressions.Regex("[^\r\n]");
    
    
    // http://drizin.io/Removing-comments-from-SQL-scripts/
    // http://web.archive.org/web/*/http://drizin.io/Removing-comments-from-SQL-scripts/
    public static string RemoveComments(string input, bool preservePositions, bool removeLiterals = false)
    {
        //based on http://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
        var lineComments = @"--(.*?)\r?\n";
        var lineCommentsOnLastLine = @"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
                                                  // literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
                                                  // there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
        var literals = @"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
        var bracketedIdentifiers = @"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
        var quotedIdentifiers = @"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
                                                              //var blockComments = @"/\*(.*?)\*/";  //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
    
        //so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
        var nestedBlockComments = @"/\*
                             (?>
                             /\*  (?<LEVEL>)      # On opening push level
                             | 
                             \*/ (?<-LEVEL>)     # On closing pop level
                             |
                             (?! /\* | \*/ ) . # Match any char unless the opening and closing strings   
                             )+                         # /* or */ in the lookahead string
                             (?(LEVEL)(?!))             # If level exists then fail
                             \*/";
    
        string noComments = System.Text.RegularExpressions.Regex.Replace(input,
            nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
            me => {
                if (me.Value.StartsWith("/*") && preservePositions)
                    return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
         else if (me.Value.StartsWith("/*") && !preservePositions)
                    return "";
                else if (me.Value.StartsWith("--") && preservePositions)
                    return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
         else if (me.Value.StartsWith("--") && !preservePositions)
                    return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
         else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
                    return me.Value; // do not remove object identifiers ever
         else if (!removeLiterals) // Keep the literal strings
             return me.Value;
                else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
         {
                    var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
                    return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
                }
                else if (removeLiterals && !preservePositions) // wrap completely all literals
             return "''";
                else
                    throw new System.NotImplementedException();
            },
            System.Text.RegularExpressions.RegexOptions.Singleline | System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
        return noComments;
    }
    
    0 讨论(0)
提交回复
热议问题