问题
I'm trying to create a stored procedure in SQL Server 2016 that converts XML that was previously converted into Varbinary
back into XML, but getting an "Illegal XML character" error when converting. I've found a workaround that seems to work, but I can't actually figure out why it works, which makes me uncomfortable.
The stored procedure takes data that was converted to binary in SSIS and inserted into a varbinary(MAX)
column in a table and performs a simple
CAST(Column AS XML)
It worked fine for a long time, and I only began seeing an issue when the initial XML started containing an ® (registered trademark) symbol.
Now, when I attempt to convert the binary to XML I get this error
Msg 9420, Level 16, State 1, Line 23
XML parsing: line 1, character 7, illegal xml character
However, if I first convert the binary to varchar(MAX)
, then convert that to XML, it seems to work fine. I don't understand what is happening when I perform that intermediate CAST that is different than casting directly to XML. My main concern is that I don't want to add it in to account for this scenario and end up with unintended consequences.
Test code:
DECLARE @foo VARBINARY(MAX)
DECLARE @bar VARCHAR(MAX)
DECLARE @Nbar NVARCHAR(MAX)
--SELECT Varbinary
SET @foo = CAST( '<Test>®</Test>' AS VARBINARY(MAX))
SELECT @foo AsBinary
--select as binary as varchar
SET @bar = CAST(@foo AS VARCHAR(MAX))
SELECT @bar BinaryAsVarchar -- Correct string output
--select binary as nvarchar
SET @nbar = CAST(@foo AS NVARCHAR(MAX))
SELECT @nbar BinaryAsNvarchar -- Chinese characters
--select binary as XML
SELECT TRY_CAST(@foo AS XML) BinaryAsXML -- ILLEGAL XML character
-- SELECT CONVERT(xml, @obfoo) BinaryAsXML --ILLEGAL XML Character
--select BinaryAsVarcharAsXML
SELECT TRY_CAST(@bar AS XML) BinaryAsVarcharAsXML -- Correct Output
--select BinaryAsNVarcharAsXML
SELECT TRY_CAST(@nbar AS XML) BinaryAsNvarcharAsXML -- Chinese Characters
回答1:
There are several things to know:
- SQL-Server is rather limited with character encodings. There is
VARCHAR
, which is 1-byte-encoded extended ASCII andNVARCHAR
, which isUCS-2
(almost the same asutf-16
). VARCHAR
uses plain latin for the first set of characters and a codepage-mapping provided by the collation in use for the second set.VARCHAR
is not utf-8.utf-8
works withVARCHAR
, as long as all characters are 1-byte-enocded. Bututf-8
knows a lot of 2-byte-enocded (up to 4-byte-enocded) characters, which would break the internal storage of aVARCHAR
string.NVARCHAR
will work with almost any 2-byte encoded character natively (that means with almost any existing character). But it is not exactlyutf-16
(there are 3-byte encoded characters, which would break SQL-Servers internal storage).- XML is not stored as the XML-string you see, but as an hierarchically organised physical table, based on
NVARCHAR
values. - The natively stored XML is really fast, while any text-based storage will need a very expensive parse-operation in advance (over and over...).
- Storing XML as string is bad, storing XML as
VARCHAR
string is even worse. - Storing a
VARCHAR
-string-XML asVARBINARY
is a cummulation of things you should not do.
Try this:
DECLARE @text1Byte VARCHAR(100)='<test>blah</test>';
DECLARE @text2Byte NVARCHAR(100)=N'<test>blah</test>';
SELECT CAST(@text1Byte AS VARBINARY(MAX)) AS text1Byte_Binary
,CAST(@text2Byte AS VARBINARY(MAX)) AS text2Byte_Binary
,CAST(@text1Byte AS XML) AS text1Byte_XML
,CAST(@text2Byte AS XML) AS text2Byte_XML
,CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS XML) AS text1Byte_XML_via_Binary
,CAST(CAST(@text2Byte AS VARBINARY(MAX)) AS XML) AS text2Byte_XML_via_Binary
The only difference you'll see are the many zeros in 0x3C0074006500730074003E0062006C00610068003C002F0074006500730074003E00
. This is due to the 2-byte-encoding of nvarchar
, each second byte is not needed in this sample. But if you'd need far-east-characters the picture would be completely different.
The reason why it works: SQL-Server is very smart. The cast from the variable to XML is rather easy, as the engine knows, that the underlying variable is varchar
or nvarchar
. But the last two casts are different. The engine has to examine the binary, whether it is a valid nvarchar
and will give it a second try with varchar
if it fails.
Now try to add your registered trademark to the given example. Add it first to the second variable DECLARE @text2Byte NVARCHAR(100)=N'<test>blah®</test>';
and try to run this. Then add it to the first variable and try it again.
What you can try:
Cast your binary to varchar(max)
, then to nvarchar(max)
and finally to xml
.
,CAST(CAST(CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS VARCHAR(MAX)) AS NVARCHAR(MAX)) AS XML) AS text1Byte_XML_via_Binary
This will work, but it won't be fast...
来源:https://stackoverflow.com/questions/53123042/tsql-illegal-xml-character-when-converting-varbinary-to-xml