I have table written to by an application. The field is varchar(max). The data looks like xml.
DECLARE @poit VARCHAR(100)
SET @poit = \'
There are a few things going on here:
The encoding=
attribute of the
element is used to denote how the underlying bytes of the XML document are to be interpreted. If the document within the string literal is correct, then there is no need to have the encoding
attribute. If there are characters that are incorrect, then the encoding
attribute can remain as it will inform the XML conversion on what those characters were originally.
UTF-8 is a Unicode encoding, yet you have the variable and literal as VARCHAR
data, not NVARCHAR
(which also requires prefixing the string literal with a capital-N
). By using VARCHAR
and no N
-prefix, if there were any characters in the XML document that couldn't fit into the Code Page denoted by the default Collation of whatever database you are in when executing this query, you would have already lost those characters (even if you can see them on screen, they wouldn't be correct in the VARCHAR
variable, or if you did a simple SELECT
of that literal).
Windows (and .NET, SQL Server, etc) use UTF-16 Little Endian. The Í
character, Latin Capital Letter I with Acute, exists in both Code Page 1252 and UTF-16LE as value 205 (e.g. SELECT ASCII('Í'), CHAR(205);
), which is why it works when you remove the encoding="utf-8"
and why you didn't "lose" that character by placing it in a VARCHAR
literal and variable. HOWEVER, as shown on that linked page, the byte sequence in the UTF-8 encoding is 195, 141 (yes, two bytes). Meaning, that character, if it truly was UTF-8 encoded, would not appear to be that character when placed into a UTF-16LE environment.
The XML conversion looks at that character's byte value of 205 (single byte since it is currently VARCHAR data) and tries to provide the UTF-16LE equivalent of what that sequence is in UTF-8. Except 205 by itself doesn't exist in UTF-8. So you need to add in the next character which is a capital-"A" which has a value of 65. While there are two-byte sequences in UTF-8, none of them are 205, 65. This is why you get the illegal xml character
error.
Since the text on screen has to be UTF-16LE, if the source really was UTF-8, then the underlying UTF-8 byte sequence would have to be converted into UTF-16LE. The underlying byte sequence of Í
is 195, 141. So we can create that sequence out of regular ASCII characters of Code Page 1252 (since this is, again, currently VARCHAR data) by doing the following:
DECLARE @poit VARCHAR(100);
SET @poit = 'V'
+ CHAR(195) + CHAR(141) + 'A ';
SELECT CONVERT(XML, @poit);
Returns:
VÍA
Data is still VARCHAR
and encoding="utf-8"
is still in the
element!
If keeping the data as VARCHAR
, then the following change of just the encoding=
value works:
DECLARE @poit VARCHAR(100);
SET @poit = 'VÍA ';
SELECT CONVERT(XML, @poit);
This assumes that the source encoding really was "Windows-1252", which is Microsoft's version of Latin1_General, which is the basis of the Latin1_General collations.
BUT, there is again no need to even specify the "encoding" if it is the same as the Code Page of the current databases's default collation as that is assumed for any VARCHAR data.
Finally, XML
data in SQL Server is UTF-16LE, same as NCHAR
and NVARCHAR
(and NTEXT
, but nobody should be using that anymore).
Use a datatype of NVARCHAR(MAX)
when working with XML as strings (not VARCHAR
).
For strings that do not have any altered characters (i.e. everything looks perfect on screen), then simply remove the encoding="utf-8"
as you are doing. There is no need to replace it with UTF-16
as that is assumed by the very nature of the value being in an NVARCHAR
variable or literal (i.e. a string prefixed with a capital-N
).
Regarding the use of VARCHAR(MAX)
instead of XML
or even NVARCHAR(MAX)
in order to save space, please keep in mind that the XML
datatype is internally optimized such that element and attribute names only get stored once, in a dictionary, and hence do not have nearly as much overhead as the fully written out string version of the XML. So while the XML
type does store strings as UTF-16LE, if the XML document has a lot of repeated element and/or attribute names, then using the XML
type might actually result in a smaller footprint than using VARCHAR(MAX)
:
DECLARE @ElementBased XML;
SET @ElementBased = (
SELECT * FROM master.sys.all_columns FOR XML PATH('Row')
);
DECLARE @AttributeBased XML;
SET @AttributeBased = (
SELECT * FROM master.sys.all_columns FOR XML RAW('Row')
);
SELECT @ElementBased AS [ElementBasedXML],
@AttributeBased AS [AttributeBasedXML],
DATALENGTH(@ElementBased) AS [ElementBasedXmlBytes],
DATALENGTH(CONVERT(VARCHAR(MAX), @ElementBased)) AS [ElementBasedVarCharBytes],
((DATALENGTH(@ElementBased) * 1.0) / DATALENGTH(CONVERT(VARCHAR(MAX), @ElementBased))
) * 100 AS [XmlElementSizeRelativeToVarcharElementSize],
DATALENGTH(@AttributeBased) AS [AttributeBasedXmlBytes],
DATALENGTH(CONVERT(VARCHAR(MAX), @AttributeBased)) AS [AttributeBasedVarCharBytes],
((DATALENGTH(@AttributeBased) * 1.0) /
DATALENGTH(CONVERT(VARCHAR(MAX), @AttributeBased))) * 100
AS [XmlAttributeSizeRelativeToVarCharAttributeSize];
Returns (on my system, at least):
ElementBasedXmlBytes 1717896
ElementBasedVarCharBytes 5889081
XmlElementSizeRelativeToVarcharElementSize 29.170867237180130482100
AttributeBasedXmlBytes 1544661
AttributeBasedVarCharBytes 3461864
XmlAttributeSizeRelativeToVarCharAttributeSize 44.619343798600984902900
As you can see, for element-based XML, the XML
datatype is 29% the size of the VARCHAR(MAX)
version, and for the attribute-based XML, the XML
datatype is 44% the size of the VARCHAR(MAX)
version.