I am trying to remove all the comments from a NVARCHAR
value.
I don\'t know which value I will get to the NVARCHAR
variable and I need to r
Using ngramsN4k:
CREATE FUNCTION dbo.NGramsN4K
(
@string nvarchar(4000), -- Input string
@N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(@string,@N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,@N) ng;
Parameters:
@string = The input string to split into tokens.
@N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a @N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (@N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (@N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (@N=3)
--===== How many times the substring "AB" appears in each record
DECLARE @table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM @table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(@string,CAST(N AS int),@N) -- the @N-Sized token
FROM iTally
WHERE @N > 0
-- Protection against bad parameter values
AND @N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)));
You can solve it using the solution below. This will be limited to NVARCHAR(4000)
but I can put together an NVARCHAR(max)
version if you need one. Also note that my solution ignores lines that begin with "--" and grabs everything up to "--" where the comment is deeper in. I'm not adressing /* this comment style */
but could be modified to do so.
Solution
-- sample stored proc
declare @storedproc varchar(8000) =
'-- Some Comments
SET NOCOUNT ON;
-- Some Comments
SELECT FirstName -- we only need the first name
FROM dbo.Users WHERE Id = @Id;';
--select @storedproc;
-- Solution
select cleanedProc =
(
select substring(item, 1, isnull(nullif(charindex('--', item),0)-1,nextPos))+br
from
(
select 0 union all
select position from dbo.ngramsN4k(@storedproc,1)
where token = char(10)
) d(position)
cross apply (values (char(10), d.position+1,
isnull(nullif(charindex(char(10), @storedproc, d.position+1),0),8000))
) p(br, startPos, nextPos)
cross apply (values (substring(@storedproc, startPos, nextPos-startPos))) split(item)
where item not like '--%'
order by position
for xml path(''), type
).value('(text())[1]', 'varchar(8000)');
before
-- Some Comments
SET NOCOUNT ON;
-- Some Comments
SELECT FirstName -- we only need the first name
FROM dbo.Users WHERE Id = @Id;
after
SET NOCOUNT ON;
SELECT FirstName
FROM dbo.Users WHERE Id = @Id;