SQL Server uses high CPU when searching inside nvarchar strings

前端 未结 5 848
孤独总比滥情好
孤独总比滥情好 2020-12-08 15:05

Check out the following example. It shows that searching within a unicode string (nvarchar) is almost eight times as bad as searching within a varchar string. And on par wit

相关标签:
5条回答
  • 2020-12-08 15:11

    My guess is that LIKE is implemented using an O(n^2) algorithm as opposed to an O(n) algorithm; it would probably have to be for the leading % to work. Since the Unicode string is twice as long, that seems consistent with your numbers.

    0 讨论(0)
  • 2020-12-08 15:12

    Looking for an explanation for this.

    NVarchar is 16 bit and Unicode comparison rules are a lot more complicated than ASCII - special chars for the various languages that are supported at the same time require quote some more processing.

    0 讨论(0)
  • A LIKE %% search is implemented as > and < . Now more the number of rows, more the processing time as SQL can't really make effective use of statistics for %% like searches.

    Additionally unicode search requires additional storage and along with collation complications, it would typically not be as efficient as the plain vanilla varchar search. The fastest collation search as you have observed is the binary collation search.

    These kind of searches are best suited for Full-Text Search or implemented using FuzzyLookup with an in-memory hash table in case you have lots of RAM and a pretty static table.

    HTH

    0 讨论(0)
  • 2020-12-08 15:27

    It's because the sorting rules of unicode characters are more complicated than sorting rules for non-unicode characters.

    But, things are not as simple as varchar vs nvarchar

    You also have to consider SQL Collation vs Windows Collation as explained here.

    SQL Server performs string comparisons of non-Unicode data defined with a Windows collation by using Unicode sorting rules. Because these rules are much more complex than non-Unicode sorting rules, they are more resource-intensive. So, although Unicode sorting rules are frequently more expensive, there is generally little difference in performance between Unicode data and non-Unicode data defined with a Windows collation.

    As it's stated, for Windows Collation, SQL Server will use unicode sorting rules for varchar, hence you will have no performance gain.

    Here is an example:

    -- Server default collation is Latin1_General_CI_AS
    create table test
    (
        testid int identity primary key,
        v varchar(36) COLLATE Latin1_General_CI_AS, --windows collation
        v_sql varchar(36) COLLATE SQL_Latin1_General_CP1_CI_AS, --sql collation
        nv nvarchar(36),
        filler char(500)
    )
    go
    
    set nocount on
    set statistics time off
    insert test (v, nv)
    select CAST (newid() as varchar(36)),
        CAST (newid() as nvarchar(36))
    go 1000000
    
    set statistics time on
    
    -- search utf8 string
    select COUNT(1) from test where v_sql like '%abcd%' option (maxdop 1)
    -- CPU time = 625 ms,  elapsed time = 620 ms.
    
    -- search utf8 string
    select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
    -- CPU time = 3141 ms,  elapsed time = 3389 ms.
    
    -- search utf8 string using unicode (uses convert_implicit)
    select COUNT(1) from test where v like N'%abcd%' option (maxdop 1)
    -- CPU time = 3203 ms,  elapsed time = 3209 ms.
    
    -- search unicode string
    select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
    -- CPU time = 3156 ms,  elapsed time = 3151 ms.
    

    As you can see, there is no difference between varchar and nvarchar with windows collation.

    Note: It seems that SQL collations are only included for legacy purpose and should not be used for new projects (even if they seem to have better performance).

    0 讨论(0)
  • 2020-12-08 15:33

    I've seen similar problems in SQL Server. There was a case where I was using parameterized queries, and my parameter was UTF-8 (default in .net) and the field was varchar (so not utf-8). Ended up with was converting every index value to utf-8 just to do a simple index lookup. This might be related in that the entire string might be getting translated to another character set to do the comparison. Also for nvarchar, "a" would be the same as "á" meaning that there's a lot more work going on there to figure out if 2 strings are equal in unicode. Also, you might want to use full text indexing, although I'm not sure if that solves your problem.

    0 讨论(0)
提交回复
热议问题