SQL Server - defining an XML type column with UTF-8 encoding

后端 未结 4 1646
忘掉有多难
忘掉有多难 2020-11-30 15:29

The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.

But if I t

相关标签:
4条回答
  • 2020-11-30 15:40

    As you found out correctly, XML will be stored as unicode (utf-16). There is no other format.

    Within SQL-Server there is VARCHAR(MAX) for extended ASCII (1-byte) and NVARCHAR(MAX) for utf-16. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX) might not be able to deal with special characters... So - if this is an issue - you should stick with unicode anyway.

    The problem occurs, when the encoding declaration is included within <?xml ...?>:

    This works:

    DECLARE @xml XML =
    '<?xml version="1.0" encoding="utf-8"?>
     <root>test</root>';
    
    SELECT @xml;
    

    This produces an error:

    DECLARE @xml XML =
    '<?xml version="1.0" encoding="utf-16"?>
     <root>test</root>';
    
    SELECT @xml;
    

    But this works again (see the leading N before the string literal):

    DECLARE @xml XML =
    N'<?xml version="1.0" encoding="utf-16"?>
     <root>test</root>';
    
    SELECT @xml;
    

    Fazit

    If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>-declaration.

    UPDATE

    You are mixing two things

    Encoding

    From your comment:

    UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...

    Yes, it's correct, that UTF-8 and UTF-16 are two flavours of unicode. But it is not correct to call utf-8 the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.

    And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'. With utf-16 the engine can cut byte 200 and 201 without looking, with utf-8 the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8 only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8 actually: extended ASCII in combination with a collation.

    I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column

    And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR:

    DECLARE @s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';
    

    This works with any combination. You can declare NVARCHAR and/or put an N in front of the literal. No problem due to implicit conversions.

    But internal VARCHAR cannot deal with higher encodings!. Try this:

     DECLARE @s NVARCHAR(MAX)=N'слов в тексте';
     SELECT @s
    

    This will work with NVARCHAR and N'Your string' only!

    XML-storage

    As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16. The xml-declaration is ommitted in any case...

    The problem occurs, when you pass in a string which is physically encoded as utf-8 but declared as something else (or vice versa). You can pass in a real UTF-16 with a declared encoding of utf-16 (same with utf-8) without problems.

    Fazit

    If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.

    0 讨论(0)
  • 2020-11-30 15:43

    Is there a way to define a SQL Server column/field as having UTF-8 encoding?

    No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.

    You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:

    1. The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
    2. The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
    3. The byte sequence needs to be the actual UTF-8 bytes.

    For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):

    SET NOCOUNT ON;
    DECLARE @XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
                        + CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
                        + '</test></root>';
    
    SELECT @XML;
    PRINT CONVERT(NVARCHAR(MAX), @XML);
    

    Returns (in both "Results" and "Messages" tabs):

    <root><test>                                                                    
    0 讨论(0)
  • 2020-11-30 15:45

    The "Type Casting String and Binary Instances" section of the MSDN document

    Create Instances of XML Data

    explains how incoming XML data is interpreted. Essentially,

    • if the SQL Server receives the XML data as nvarchar then it "assumes a two-byte unicode encoding such as UTF-16 or UCS-2",

    • if the SQL Server receives the XML data as varchar then by default it will use the (single-byte character set) code page defined for the SQL Server instance,

    • if the SQL Server receives the XML data as varbinary then it "is treated as a codepoint stream that is passed directly to the XML parser", and "an instance without BOM and without a declaration encoding will be interpreted as UTF-8".

    If your marshalling code is spitting out a Java String to be sent to the SQL Server then it is very likely being sent as nvarchar since a Java String is always a Unicode string. That would explain why the SQL Server assumes UTF-16 encoding.

    If you really need to send the XML data to the SQL Server with UTF-8 encoding (though I can't imagine why) then your marshalling code probably needs to produce a stream of (UTF-8 encoded) bytes that will be sent to the SQL Server as varbinary.

    0 讨论(0)
  • 2020-11-30 15:49

    A 2-step works; first encode your UTF-8 to text or varchar(MAX) and then to xml.

    convert(xml, convert(text, '<your UTF-8 xml>'))
    
    0 讨论(0)
提交回复
热议问题