Check out the following example. It shows that searching within a unicode string (nvarchar) is almost eight times as bad as searching within a varchar string. And on par wit
My guess is that LIKE
is implemented using an O(n^2) algorithm as opposed to an O(n) algorithm; it would probably have to be for the leading %
to work. Since the Unicode string is twice as long, that seems consistent with your numbers.
Looking for an explanation for this.
NVarchar is 16 bit and Unicode comparison rules are a lot more complicated than ASCII - special chars for the various languages that are supported at the same time require quote some more processing.
A LIKE %% search is implemented as > and < . Now more the number of rows, more the processing time as SQL can't really make effective use of statistics for %% like searches.
Additionally unicode search requires additional storage and along with collation complications, it would typically not be as efficient as the plain vanilla varchar search. The fastest collation search as you have observed is the binary collation search.
These kind of searches are best suited for Full-Text Search or implemented using FuzzyLookup with an in-memory hash table in case you have lots of RAM and a pretty static table.
HTH
It's because the sorting rules of unicode characters are more complicated than sorting rules for non-unicode characters.
But, things are not as simple as varchar vs nvarchar
You also have to consider SQL Collation vs Windows Collation as explained here.
SQL Server performs string comparisons of non-Unicode data defined with a Windows collation by using Unicode sorting rules. Because these rules are much more complex than non-Unicode sorting rules, they are more resource-intensive. So, although Unicode sorting rules are frequently more expensive, there is generally little difference in performance between Unicode data and non-Unicode data defined with a Windows collation.
As it's stated, for Windows Collation, SQL Server will use unicode sorting rules for varchar, hence you will have no performance gain.
Here is an example:
-- Server default collation is Latin1_General_CI_AS
create table test
(
testid int identity primary key,
v varchar(36) COLLATE Latin1_General_CI_AS, --windows collation
v_sql varchar(36) COLLATE SQL_Latin1_General_CP1_CI_AS, --sql collation
nv nvarchar(36),
filler char(500)
)
go
set nocount on
set statistics time off
insert test (v, nv)
select CAST (newid() as varchar(36)),
CAST (newid() as nvarchar(36))
go 1000000
set statistics time on
-- search utf8 string
select COUNT(1) from test where v_sql like '%abcd%' option (maxdop 1)
-- CPU time = 625 ms, elapsed time = 620 ms.
-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 3141 ms, elapsed time = 3389 ms.
-- search utf8 string using unicode (uses convert_implicit)
select COUNT(1) from test where v like N'%abcd%' option (maxdop 1)
-- CPU time = 3203 ms, elapsed time = 3209 ms.
-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 3156 ms, elapsed time = 3151 ms.
As you can see, there is no difference between varchar and nvarchar with windows collation.
Note: It seems that SQL collations are only included for legacy purpose and should not be used for new projects (even if they seem to have better performance).
I've seen similar problems in SQL Server. There was a case where I was using parameterized queries, and my parameter was UTF-8 (default in .net) and the field was varchar (so not utf-8). Ended up with was converting every index value to utf-8 just to do a simple index lookup. This might be related in that the entire string might be getting translated to another character set to do the comparison. Also for nvarchar, "a" would be the same as "á" meaning that there's a lot more work going on there to figure out if 2 strings are equal in unicode. Also, you might want to use full text indexing, although I'm not sure if that solves your problem.