Question:
Can anybody give me a working regex expression (C#/VB.NET) that can remove single line comments from a SQL statement ?
I mean these comments:
I will disappoint all of you. This can't be done with regular expressions. Sure, it's easy to find comments not in a string (that even the OP could do), the real deal is comments in a string. There is a little hope of the look arounds, but that's still not enough. By telling that you have a preceding quote in a line won't guarantee anything. The only thing what guarantees you something is the oddity of quotes. Something you can't find with regular expression. So just simply go with non-regular-expression approach.
EDIT: Here's the c# code:
String sql = "--this is a test\r\nselect stuff where substaff like '--this comment should stay' --this should be removed\r\n";
char[] quotes = { '\'', '"'};
int newCommentLiteral, lastCommentLiteral = 0;
while ((newCommentLiteral = sql.IndexOf("--", lastCommentLiteral)) != -1)
{
int countQuotes = sql.Substring(lastCommentLiteral, newCommentLiteral - lastCommentLiteral).Split(quotes).Length - 1;
if (countQuotes % 2 == 0) //this is a comment, since there's an even number of quotes preceding
{
int eol = sql.IndexOf("\r\n") + 2;
if (eol == -1)
eol = sql.Length; //no more newline, meaning end of the string
sql = sql.Remove(newCommentLiteral, eol - newCommentLiteral);
lastCommentLiteral = newCommentLiteral;
}
else //this is within a string, find string ending and moving to it
{
int singleQuote = sql.IndexOf("'", newCommentLiteral);
if (singleQuote == -1)
singleQuote = sql.Length;
int doubleQuote = sql.IndexOf('"', newCommentLiteral);
if (doubleQuote == -1)
doubleQuote = sql.Length;
lastCommentLiteral = Math.Min(singleQuote, doubleQuote) + 1;
//instead of finding the end of the string you could simply do += 2 but the program will become slightly slower
}
}
Console.WriteLine(sql);
What this does: find every comment literal. For each, check if it's within a comment or not, by counting the number of quotes between the current match and the last one. If this number is even, then it's a comment, thus remove it (find first end of line and remove whats between). If it's odd, this is within a string, find the end of the string and move to it. Rgis snippet is based on a wierd SQL trick: 'this" is a valid string. Even tho the 2 quotes differ. If it's not true for your SQL language, you should try a completely different approach. I'll write a program to that too if that's the case, but this one's faster and more straightforward.
You want something like this for the simple case
-{2,}.*
The -{2,} looks for a dash that happens 2 or more times
The .* gets the rest of the lines up to the newline
*But, for the edge cases, it appears that SinistraD is correct in that you cannot catch everything, however here is an article about how this can be done in C# with a combination of code and regex.
I don't know if C#/VB.net regex is special in some way but traditionally s/--.*//
should work.
This seems to work well for me so far; it even ignores comments within strings, such as SELECT '--not a comment--' FROM ATable
private static string removeComments(string sql)
{
string pattern = @"(?<=^ ([^'""] |['][^']*['] |[""][^""]*[""])*) (--.*$|/\*(.|\n)*?\*/)";
return Regex.Replace(sql, pattern, "", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
}
Note: it is designed to eliminate both /**/
-style comments as well as --
style. Remove |/\*(.|\n)*?\*/
to get rid of the /**/
checking. Also be sure you are using the RegexOptions.IgnorePatternWhitespace
Regex option!!
I wanted to be able to handle double-quotes too, but since T-SQL doesn't support them, you could get rid of |[""][^""]*[""]
too.
Adapted from here.
Note (Mar 2015): In the end, I wound up using Antlr, a parser generator, for this project. There may have been some edge cases where the regex didn't work. In the end I was much more confident with the results having used Antlr, and it's worked well.
Using System.Text.RegularExpressions;
public static string RemoveSQLCommentCallback(Match SQLLineMatch)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
bool open = false; //opening of SQL String found
char prev_ch = ' ';
foreach (char ch in SQLLineMatch.ToString())
{
if (ch == '\'')
{
open = !open;
}
else if ((!open && prev_ch == '-' && ch == '-'))
{
break;
}
sb.Append(ch);
prev_ch = ch;
}
return sb.ToString().Trim('-');
}
The code
public static void Main()
{
string sqlText = "WHERE DEPT_NAME LIKE '--Test--' AND START_DATE < SYSDATE -- Don't go over today";
//for every matching line call callback func
string result = Regex.Replace(sqlText, ".*--.*", RemoveSQLCommentCallback);
}
Let's replace, find all the lines that match dash dash comment and call your parsing function for every match.
As a late solution, the simplest way is to do it using ScriptDom-TSqlParser:
// https://michaeljswart.com/2014/04/removing-comments-from-sql/
// http://web.archive.org/web/*/https://michaeljswart.com/2014/04/removing-comments-from-sql/
public static string StripCommentsFromSQL(string SQL)
{
Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser parser =
new Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser(true);
System.Collections.Generic.IList<Microsoft.SqlServer.TransactSql.ScriptDom.ParseError> errors;
Microsoft.SqlServer.TransactSql.ScriptDom.TSqlFragment fragments =
parser.Parse(new System.IO.StringReader(SQL), out errors);
// clear comments
string result = string.Join(
string.Empty,
fragments.ScriptTokenStream
.Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.MultilineComment)
.Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.SingleLineComment)
.Select(x => x.Text));
return result;
}
or instead of using the Microsoft-Parser, you can use ANTL4 TSqlLexer
or without any parser at all:
private static System.Text.RegularExpressions.Regex everythingExceptNewLines =
new System.Text.RegularExpressions.Regex("[^\r\n]");
// http://drizin.io/Removing-comments-from-SQL-scripts/
// http://web.archive.org/web/*/http://drizin.io/Removing-comments-from-SQL-scripts/
public static string RemoveComments(string input, bool preservePositions, bool removeLiterals = false)
{
//based on http://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
var lineComments = @"--(.*?)\r?\n";
var lineCommentsOnLastLine = @"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
// literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
// there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
var literals = @"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
var bracketedIdentifiers = @"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
var quotedIdentifiers = @"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
//var blockComments = @"/\*(.*?)\*/"; //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
//so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
var nestedBlockComments = @"/\*
(?>
/\* (?<LEVEL>) # On opening push level
|
\*/ (?<-LEVEL>) # On closing pop level
|
(?! /\* | \*/ ) . # Match any char unless the opening and closing strings
)+ # /* or */ in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
\*/";
string noComments = System.Text.RegularExpressions.Regex.Replace(input,
nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
me => {
if (me.Value.StartsWith("/*") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
else if (me.Value.StartsWith("/*") && !preservePositions)
return "";
else if (me.Value.StartsWith("--") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
else if (me.Value.StartsWith("--") && !preservePositions)
return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
return me.Value; // do not remove object identifiers ever
else if (!removeLiterals) // Keep the literal strings
return me.Value;
else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
{
var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
}
else if (removeLiterals && !preservePositions) // wrap completely all literals
return "''";
else
throw new System.NotImplementedException();
},
System.Text.RegularExpressions.RegexOptions.Singleline | System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
return noComments;
}