SQL CLR function based on .net ComputeHash is not working with Cyrrilic

问题

I have written the following SQL CLR function in order to hash string values larger then 8000 bytes (the limit of input value of the T-SQL built-it HASHBYTES function):

[SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true)]
public static SqlBinary HashBytes(SqlString algorithm, SqlString value)
{
    HashAlgorithm algorithmType = HashAlgorithm.Create(algorithm.Value);

    if (algorithmType == null || value.IsNull)
    {
        return new SqlBinary();
    }
    else
    {
        byte[] bytes = Encoding.UTF8.GetBytes(value.Value);
        return new SqlBinary(algorithmType.ComputeHash(bytes));
    }
}

It is working fine for Latin strings. For example, the following hashes are the same:

SELECT dbo.fn_Utils_GetHashBytes ('MD5', 'test'); -- 0x098F6BCD4621D373CADE4E832627B4F6
SELECT HASHBYTES ('MD5', 'test');                 -- 0x098F6BCD4621D373CADE4E832627B4F6

The issue is it is not working with Cyrillic strings. For example:

SELECT dbo.fn_Utils_GetHashBytes ('MD5 ', N'даровете на влъхвите') -- NULL
SELECT HashBytes ('MD5 ',N'даровете на влъхвите') -- 0x838B1B625A6074B2BE55CDB7FCEA2832

SELECT dbo.fn_Utils_GetHashBytes ('SHA256', N'даровете на влъхвите') -- 0xA1D65374A0B954F8291E00BC3DD9DF655D8A4A6BF127CFB15BBE794D2A098844
SELECT HashBytes ('SHA2_256',N'даровете на влъхвите') -- 0x375F6993E0ECE1864336E565C8E14848F2A4BAFCF60BC0C8F5636101DD15B25A

I am getting NULL for MD5, although the code returns value if it is executed as console application. Could anyone tell what I am doing wrong?

Also, I've got the function from here and one of the comments says that:

Careful with CLR SP parameters being silently truncated to 8000 bytes - I had to tag the parameter with [SqlFacet(MaxSize = -1)] otherwise bytes after the 8000th would simply be ignored!

but I have tested this and it is working fine. For example, if I generate a hash of 8000 bytes string and a second hash of the same string plus one symbol, I get the hashes are different.

DECLARE @A VARCHAR(MAX) = '8000 bytes string...'
DECLARE @B VARCHAR(MAX) = @A + '1'
SELECT LEN(@A), LEN(@B)

SELECT IIF(dbo.fn_Utils_GetHashBytes ('MD5', @A + '1') = dbo.fn_Utils_GetHashBytes ('MD5', @B), 1, 0) -- 0

Should I worry about this?

回答1:

 Encoding.UTF8.GetBytes(...)

SQL Server has no concept of UTF-8. Use UCS-2 (UTF-16) or ASCII. The encoding used must match what you'd pass to HASHBYTES. You can easily see that HASHBYTES will hash differently VARCHAR vs. NVARCHAR:

select HASHBYTES('MD5', 'Foo')  -- 0x1356C67D7AD1638D816BFB822DD2C25D
select HASHBYTES('MD5', N'Foo') -- 0xB25FF0AD90D09D395090E8A29FF4C63C

Best would be to change the SQLCLR function to accept the bytes, not a string, and deal with the cast to VARBINARY in the caller.

 SELECT dbo.fn_Utils_GetHashBytes ('MD5', CAST(N'даровете на влъхвите' AS VARBINARY(MAX));

FYI SQL Server 2016 has lifted the 8000 bytes restriction on HASHBYTES:

For SQL Server 2014 and earlier, allowed input values are limited to 8000 bytes.

回答2:

For a detailed walk-through that explains why you are seeing the differences, please see my answer to the following Question:

TSQL md5 hash different to C# .NET md5

And for anyone who does not wish to compile and deploy this themselves, this function is available in the Free version of the SQL# library of SQLCLR functions, stored procedures, etc (which I am the creator of, but Util_Hash and Util_HashBinary, among many others, are free). There is one difference between what is shown in the Question the two Util_Hash* functions in SQL#: the function shown in the Question takes a NVARCHAR / SqlString input parameter whereas the SQL# functions take VARBINARY / SqlBinary input. The differences are:

Accepting VARBINARY input also works for binary source data (files, images, encrypted values, etc)
While accepting VARBINARY input does require an extra step of doing a CONVERT(VARBINARY(MAX), source_string) in the call to the function, doing so preserves whatever Code Page is being used for VARCHAR data. While not used that often, this can be handy when working with non-Unicode data.

Regarding the warning from the other post of:

Careful with CLR SP parameters being silently truncated to 8000 bytes - I had to tag the parameter with [SqlFacet(MaxSize = -1)] otherwise bytes after the 8000th would simply be ignored!

and yet you not experiencing the same thing: this due to changes in how SSDT generates the T-SQL wrapper objects for SQLCLR objects. In earlier versions (especially those that came with Visual Studio prior to VS 2013), the default behavior was to use NVARCHAR(MAX) for SqlChars, and NVARCHAR(4000) for SqlString. But then at some point (I don't want to say as of VS 2013 since Visual Studio and SSDT are independent products even though VS comes with SSDT) the default was changed to use NVARCHAR(MAX) for both SqlChars and SqlString. The person who posted the warning (on 2013-02-06) must have been using an earlier version of SSDT. Still, it doesn't hurt (and is even a good practice) to be explicit and use [SqlFacet(MaxSize = -1)].

Regarding the if (algorithmType == null || value.IsNull) logic: since either one being NULL should return a NULL, you might be better off removing that logic and using the WITH RETURNS NULL ON NULL INPUT option of the CREATE FUNCTION statement. This option, however, is unfortunately not supported via any SSDT construct (i.e. no SqlFacet for it). So in order to get this option enabled, you can create a Post-Deployment SQL script (which will automatically deploy after the main script), that issues an ALTER FUNCTION with the desired definition. And it wouldn't hurt to vote for my Connect suggestion to natively support this option: Implement OnNullCall property in SqlFunctionAttribute for RETURNS NULL ON NULL INPUT SQLCLR. On a practical level, the performance gain would be mainly seen in situation where you are passing in large values for the @value parameter but yet somehow @algorithm is NULL and so you don't end up using the value of @value. The reason to use the RETURNS NULL ON NULL INPUT option is that when you call a SQLCLR function passing in either SqlString or SqlBinary, the entire value is copied over to the App Domain's memory. That is time, memory, and CPU you don't need to waste if you know ahead of time that you won't be using it :-). You might also see a gain, even if passing in smaller values, on functions that are called very frequently.

Side note regarding the warning and your test: SQLCLR does not support VARCHAR, only NVARCHAR. Hence, there never was a limit of 8000 since the limit would have been at 4000 characters had SSDT not automatically been using NVARCHAR(MAX). So if there would have been a difference, it would have been seen first testing with only 4000 and 4001 characters.

来源：https://stackoverflow.com/questions/35266938/sql-clr-function-based-on-net-computehash-is-not-working-with-cyrrilic

标签

.net

tsql

hash

sqlclr