Converting accented characters in varchar() to XML causing “illegal XML character”

后端 未结 2 1134
迷失自我
迷失自我 2021-01-13 14:30

I have table written to by an application. The field is varchar(max). The data looks like xml.

DECLARE @poit VARCHAR(100)
SET @poit = \'

        
相关标签:
2条回答
  • 2021-01-13 14:59

    I would try changing the datatype of your @poit variable from VARCHAR(100) to NVARCHAR(100). Then replace the utf-8 encoding with utf-16 so your code would look something like:

        DECLARE @poit NVARCHAR(100)
        SET @poit = '<?xml version="1.0" encoding="utf-8"?><test>VÍA</test>'
        SELECT CONVERT(XML,REPLACE(@poit, 'utf-8', 'utf-16'))
    

    As long as you're not calling the conversion with the replace in it in a SELECT that returns oodles of results, the performance should be just fine and it will get the job done.

    Reference: http://xml.silmaril.ie/characters.html <- scroll down and you'll see some info as to the difference between utf-8 & utf-16. Hope this helps!

    0 讨论(0)
  • 2021-01-13 15:04

    <TL;DR> If you just want the answer without the full explanation, scroll down to the "Conclusion". But, you really should take a moment to read the explanation 😸 </TL;DR>

    There are a few things going on here:

    1. The encoding= attribute of the <xml> element is used to denote how the underlying bytes of the XML document are to be interpreted. If the document within the string literal is correct, then there is no need to have the encoding attribute. If there are characters that are incorrect, then the encoding attribute can remain as it will inform the XML conversion on what those characters were originally.

    2. UTF-8 is a Unicode encoding, yet you have the variable and literal as VARCHAR data, not NVARCHAR (which also requires prefixing the string literal with a capital-N). By using VARCHAR and no N-prefix, if there were any characters in the XML document that couldn't fit into the Code Page denoted by the default Collation of whatever database you are in when executing this query, you would have already lost those characters (even if you can see them on screen, they wouldn't be correct in the VARCHAR variable, or if you did a simple SELECT of that literal).

    3. Windows (and .NET, SQL Server, etc) use UTF-16 Little Endian. The Í character, Latin Capital Letter I with Acute, exists in both Code Page 1252 and UTF-16LE as value 205 (e.g. SELECT ASCII('Í'), CHAR(205); ), which is why it works when you remove the encoding="utf-8" and why you didn't "lose" that character by placing it in a VARCHAR literal and variable. HOWEVER, as shown on that linked page, the byte sequence in the UTF-8 encoding is 195, 141 (yes, two bytes). Meaning, that character, if it truly was UTF-8 encoded, would not appear to be that character when placed into a UTF-16LE environment.

      The XML conversion looks at that character's byte value of 205 (single byte since it is currently VARCHAR data) and tries to provide the UTF-16LE equivalent of what that sequence is in UTF-8. Except 205 by itself doesn't exist in UTF-8. So you need to add in the next character which is a capital-"A" which has a value of 65. While there are two-byte sequences in UTF-8, none of them are 205, 65. This is why you get the illegal xml character error.

    4. Since the text on screen has to be UTF-16LE, if the source really was UTF-8, then the underlying UTF-8 byte sequence would have to be converted into UTF-16LE. The underlying byte sequence of Í is 195, 141. So we can create that sequence out of regular ASCII characters of Code Page 1252 (since this is, again, currently VARCHAR data) by doing the following:

      DECLARE @poit VARCHAR(100);
      SET @poit = '<?xml version="1.0" encoding="UTF-8"?><test>V'
                    + CHAR(195) + CHAR(141) + 'A</test>';
      SELECT CONVERT(XML, @poit);
      

      Returns:

      <test>VÍA</test>
      

      Data is still VARCHAR and encoding="utf-8" is still in the <xml> element!

    5. If keeping the data as VARCHAR, then the following change of just the encoding= value works:

      DECLARE @poit VARCHAR(100);
      SET @poit = '<?xml version="1.0" encoding="Windows-1252"?><test>VÍA</test>';
      SELECT CONVERT(XML, @poit);
      

      This assumes that the source encoding really was "Windows-1252", which is Microsoft's version of Latin1_General, which is the basis of the Latin1_General collations.

      BUT, there is again no need to even specify the "encoding" if it is the same as the Code Page of the current databases's default collation as that is assumed for any VARCHAR data.

    6. Finally, XML data in SQL Server is UTF-16LE, same as NCHAR and NVARCHAR (and NTEXT, but nobody should be using that anymore).

    CONCLUSION

    1. Use a datatype of NVARCHAR(MAX) when working with XML as strings (not VARCHAR).

    2. For strings that do not have any altered characters (i.e. everything looks perfect on screen), then simply remove the encoding="utf-8" as you are doing. There is no need to replace it with UTF-16 as that is assumed by the very nature of the value being in an NVARCHAR variable or literal (i.e. a string prefixed with a capital-N).


    Regarding the use of VARCHAR(MAX) instead of XML or even NVARCHAR(MAX) in order to save space, please keep in mind that the XML datatype is internally optimized such that element and attribute names only get stored once, in a dictionary, and hence do not have nearly as much overhead as the fully written out string version of the XML. So while the XML type does store strings as UTF-16LE, if the XML document has a lot of repeated element and/or attribute names, then using the XML type might actually result in a smaller footprint than using VARCHAR(MAX):

    DECLARE @ElementBased XML;
    SET @ElementBased = (
                         SELECT * FROM master.sys.all_columns FOR XML PATH('Row')
                        );
    
    DECLARE @AttributeBased XML;
    SET @AttributeBased = (
                           SELECT * FROM master.sys.all_columns FOR XML RAW('Row')
                          );
    
    SELECT @ElementBased AS [ElementBasedXML],
           @AttributeBased AS [AttributeBasedXML],
    
           DATALENGTH(@ElementBased) AS [ElementBasedXmlBytes],
           DATALENGTH(CONVERT(VARCHAR(MAX), @ElementBased)) AS [ElementBasedVarCharBytes],
           ((DATALENGTH(@ElementBased) * 1.0) / DATALENGTH(CONVERT(VARCHAR(MAX), @ElementBased))
                   ) * 100 AS [XmlElementSizeRelativeToVarcharElementSize],
    
           DATALENGTH(@AttributeBased) AS [AttributeBasedXmlBytes],
           DATALENGTH(CONVERT(VARCHAR(MAX), @AttributeBased)) AS [AttributeBasedVarCharBytes],
           ((DATALENGTH(@AttributeBased) * 1.0) /
             DATALENGTH(CONVERT(VARCHAR(MAX), @AttributeBased))) * 100
                   AS [XmlAttributeSizeRelativeToVarCharAttributeSize];
    

    Returns (on my system, at least):

    ElementBasedXmlBytes                              1717896
    ElementBasedVarCharBytes                          5889081
    XmlElementSizeRelativeToVarcharElementSize        29.170867237180130482100
    
    AttributeBasedXmlBytes                            1544661
    AttributeBasedVarCharBytes                        3461864
    XmlAttributeSizeRelativeToVarCharAttributeSize    44.619343798600984902900
    

    As you can see, for element-based XML, the XML datatype is 29% the size of the VARCHAR(MAX) version, and for the attribute-based XML, the XML datatype is 44% the size of the VARCHAR(MAX) version.

    0 讨论(0)
提交回复
热议问题