TSQL “Illegal XML Character” When Converting Varbinary to XML

早过忘川 提交于 2021-01-29 07:20:18

问题


I'm trying to create a stored procedure in SQL Server 2016 that converts XML that was previously converted into Varbinary back into XML, but getting an "Illegal XML character" error when converting. I've found a workaround that seems to work, but I can't actually figure out why it works, which makes me uncomfortable.

The stored procedure takes data that was converted to binary in SSIS and inserted into a varbinary(MAX) column in a table and performs a simple

CAST(Column AS XML)

It worked fine for a long time, and I only began seeing an issue when the initial XML started containing an ® (registered trademark) symbol.

Now, when I attempt to convert the binary to XML I get this error

Msg 9420, Level 16, State 1, Line 23
XML parsing: line 1, character 7, illegal xml character

However, if I first convert the binary to varchar(MAX), then convert that to XML, it seems to work fine. I don't understand what is happening when I perform that intermediate CAST that is different than casting directly to XML. My main concern is that I don't want to add it in to account for this scenario and end up with unintended consequences.

Test code:

DECLARE @foo VARBINARY(MAX)
DECLARE @bar VARCHAR(MAX)
DECLARE @Nbar NVARCHAR(MAX) 

--SELECT Varbinary
SET @foo = CAST( '<Test>®</Test>' AS VARBINARY(MAX)) 
SELECT @foo AsBinary


--select as binary as varchar
SET @bar = CAST(@foo AS VARCHAR(MAX))

SELECT @bar BinaryAsVarchar                             -- Correct string output

--select binary as nvarchar
SET @nbar = CAST(@foo AS NVARCHAR(MAX))
SELECT @nbar BinaryAsNvarchar                           -- Chinese characters 

--select binary as XML
SELECT TRY_CAST(@foo AS XML) BinaryAsXML                -- ILLEGAL XML character
-- SELECT CONVERT(xml, @obfoo) BinaryAsXML                    --ILLEGAL XML Character

--select BinaryAsVarcharAsXML
SELECT TRY_CAST(@bar AS XML) BinaryAsVarcharAsXML       -- Correct Output

--select BinaryAsNVarcharAsXML
SELECT TRY_CAST(@nbar AS XML) BinaryAsNvarcharAsXML     -- Chinese Characters

回答1:


There are several things to know:

  • SQL-Server is rather limited with character encodings. There is VARCHAR, which is 1-byte-encoded extended ASCII and NVARCHAR, which is UCS-2 (almost the same as utf-16).
  • VARCHAR uses plain latin for the first set of characters and a codepage-mapping provided by the collation in use for the second set.
  • VARCHAR is not utf-8. utf-8 works with VARCHAR, as long as all characters are 1-byte-enocded. But utf-8 knows a lot of 2-byte-enocded (up to 4-byte-enocded) characters, which would break the internal storage of a VARCHAR string.
  • NVARCHAR will work with almost any 2-byte encoded character natively (that means with almost any existing character). But it is not exactly utf-16 (there are 3-byte encoded characters, which would break SQL-Servers internal storage).
  • XML is not stored as the XML-string you see, but as an hierarchically organised physical table, based on NVARCHAR values.
  • The natively stored XML is really fast, while any text-based storage will need a very expensive parse-operation in advance (over and over...).
  • Storing XML as string is bad, storing XML as VARCHAR string is even worse.
  • Storing a VARCHAR-string-XML as VARBINARY is a cummulation of things you should not do.

Try this:

DECLARE @text1Byte VARCHAR(100)='<test>blah</test>';
DECLARE @text2Byte NVARCHAR(100)=N'<test>blah</test>';

SELECT CAST(@text1Byte AS VARBINARY(MAX)) AS text1Byte_Binary
      ,CAST(@text2Byte AS VARBINARY(MAX)) AS text2Byte_Binary
      ,CAST(@text1Byte AS XML) AS text1Byte_XML
      ,CAST(@text2Byte AS XML) AS text2Byte_XML
      ,CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS XML) AS text1Byte_XML_via_Binary
      ,CAST(CAST(@text2Byte AS VARBINARY(MAX)) AS XML) AS text2Byte_XML_via_Binary

The only difference you'll see are the many zeros in 0x3C0074006500730074003E0062006C00610068003C002F0074006500730074003E00. This is due to the 2-byte-encoding of nvarchar, each second byte is not needed in this sample. But if you'd need far-east-characters the picture would be completely different.

The reason why it works: SQL-Server is very smart. The cast from the variable to XML is rather easy, as the engine knows, that the underlying variable is varchar or nvarchar. But the last two casts are different. The engine has to examine the binary, whether it is a valid nvarchar and will give it a second try with varchar if it fails.

Now try to add your registered trademark to the given example. Add it first to the second variable DECLARE @text2Byte NVARCHAR(100)=N'<test>blah®</test>'; and try to run this. Then add it to the first variable and try it again.

What you can try:

Cast your binary to varchar(max), then to nvarchar(max) and finally to xml.

,CAST(CAST(CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS VARCHAR(MAX)) AS NVARCHAR(MAX)) AS XML) AS text1Byte_XML_via_Binary

This will work, but it won't be fast...



来源:https://stackoverflow.com/questions/53123042/tsql-illegal-xml-character-when-converting-varbinary-to-xml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!