It seems that SQL Server uses Unicode UCS-2, a 2-byte fixed-length character encoding, for nchar/nvarchar
fields. Meanwhile, C# uses Unicode
I don't expect that treating the text as UCS-2 would cause many problems.
Case conversions should not be a problem, because (AFAIK) there are no case mappings above the BMP (except the identity mapping, of course!), and, obviously, the surrogate characters are going to map to themselves.
Blanking every other character is just asking for trouble. In reality, doing these sorts of transformations without consideration of the character values is always a dangerous activity. I can see it happening legitimately with string truncations. But if any unmatched surrogates show up in the result, this itself is not a huge problem. Any system that receives such data—and cares—will probably just replace the unmatched surrogate with a replacement character, if it bothers to do anything about it at all.
Obviously, string length is going to be bytes/2 rather than number-of-characters, but number-of-characters is not a very useful value anyway, once you start plumbing the depths of the Unicode code charts. For example, you aren't going to get good results in monospaced display once you leave the ASCII range, because of combining characters, RTL languages, directional control characters, tags, and several kinds of space characters. The high code points are going to be the least of your problems.
Just to be on the safe side, you should probably store your cuneiform texts in a different column than the archeologist's names. :D
UPDATE now with empirical data!
I just ran a test to see what happens with case transformations. I created a string with the English word TEST in uppercase twice—first in Latin script, then in Deseret script. I applied a lower-case transformation to this string in .NET and in SQL Server.
The .NET version correctly lowercased all the letters in both scripts. The SQL Server version only lowercased the Latin characters and left the Deseret characters unchanged. This meets with expectations regarding the handling of UTF-16 verses UCS-2.
using System;
using System.Data.SqlClient;
class Program
{
static void Main(string[] args)
{
string myDeseretText = "TEST\U00010413\U00010407\U0001041D\U00010413";
string dotNetLower = myDeseretText.ToLower();
string dbLower = LowercaseInDb(myDeseretText);
Console.WriteLine(" Original: {0}", DisplayUtf16CodeUnits(myDeseretText));
Console.WriteLine(".NET Lower: {0}", DisplayUtf16CodeUnits(dotNetLower));
Console.WriteLine(" DB Lower: {0}", DisplayUtf16CodeUnits(dbLower));
Console.ReadLine();
}
private static string LowercaseInDb(string value)
{
SqlConnectionStringBuilder connection = new SqlConnectionStringBuilder();
connection.DataSource = "(local)";
connection.IntegratedSecurity = true;
using (SqlConnection conn = new SqlConnection(connection.ToString()))
{
conn.Open();
string commandText = "SELECT LOWER(@myString) as LoweredString";
using (SqlCommand comm = new SqlCommand(commandText, conn))
{
comm.CommandType = System.Data.CommandType.Text;
comm.Parameters.Add("@myString", System.Data.SqlDbType.NVarChar, 100);
comm.Parameters["@myString"].Value = value;
using (SqlDataReader reader = comm.ExecuteReader())
{
reader.Read();
return (string)reader["LoweredString"];
}
}
}
}
private static string DisplayUtf16CodeUnits(string value)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
foreach (char c in value)
sb.AppendFormat("{0:X4} ", (int)c);
return sb.ToString();
}
}
Output:
Original: 0054 0045 0053 0054 D801 DC13 D801 DC07 D801 DC1D D801 DC13
.NET Lower: 0074 0065 0073 0074 D801 DC3B D801 DC2F D801 DC45 D801 DC3B
DB Lower: 0074 0065 0073 0074 D801 DC13 D801 DC07 D801 DC1D D801 DC13
Just in case anyone has a Deseret font installed, here are the actual strings for your enjoyment:
Original: TEST
It's all a bit of a fudge really.
First the similarities
nchar
/nvarchar
/ntext
data types store text as a string of 2-byte characters. It doesn't really care what you put in them until you come to do searching and sorting (then it uses the appropriate Unicode collation sequence).String
data type also stores text as a string of 2-byte Char
s. It also doesn't really care what you put in it until you come to do searching and sorting (then it uses the appropriate culture-specific methods).Now the differences
String
, it will always encode the string as UTF-16 (with full multilingual plane support).In short, as long as you treat both CLR and SQL Server string variables as whole blobs of text, then you can freely assign from one to the other with no loss of information. The underlying storage format is exactly the same, even though the abstractions layered on top are slightly different.