This is not a question on how to overcome the \"XML parsing: ... illegal xml character\" error, but about why it is happening? I know tha
The MSDN guidelines says:
SQLXML 4.0 relies upon the limited support for DTDs provided in SQL Server. SQL Server allows for an internal DTD in xml data type data, which can be used to supply default values and to replace entity references with their expanded contents. SQLXML passes the XML data "as is" (including the internal DTD) to the server. You can convert DTDs to XML Schema (XSD) documents using third-party tools, and load the data with inline XSD schemas into the database.
Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.
The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.
This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:
DECLARE @badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 29, illegal xml character
Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:
DECLARE @xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 1, character 56, unable to switch the encoding
Alternatives that work are:
1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):
DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
2) As above but with decimal encoding on the entity (reference):
DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):
DECLARE @xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N
before casting as XML):
DECLARE @xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option – Bar" />
</records>';
SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode
The reason you are looking for: With UTF-8 specified, this character is not known.
--without your directive, SQL Server picks its default
declare @xml XML =
'<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select @xml;
--or UNICODE, but you must use UTF-16
declare @xml2 XML =
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));
select @xml2
UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...
Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.
Hope this is clear now...
This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:
DECLARE @x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
SELECT SUBSTRING(@x,Nmbr,1), ASCII(SUBSTRING(@x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(@x,Nmbr,1)) IS NOT NULL;
Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.
Can you modify the XML encoding declaration? If so;
declare @xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
select @xml
(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>
Both of these fail with illegal xml character:
set @xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set @xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'
because they pass a non-unicode varchar
to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar
(utf-16) (otherwise the 3 bytes comprising the –
are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)
This does pass a nvarchar
string to the parser,
but fails with unable to switch the encoding:
set @xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'
This is because an nvarchar
(utf-16) string is passed to the XML parser but the XML document states its utf-8 and the –
is not equivalent in the two encodings
This works as everything is utf-16
set @xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'