What are invalid characters in XML

后端 未结 15 1332
时光说笑
时光说笑 2020-11-22 03:23

I am working with some XML that holds strings like:

This is a string

Some of the strings that I am passing to the

相关标签:
15条回答
  • 2020-11-22 04:07

    OK, let's separate the question of the characters that:

    1. aren't valid at all in any XML document.
    2. need to be escaped.

    The answer provided by @dolmen in "What are invalid characters in XML" is still valid but needs to be updated with the XML 1.1 specification.

    1. Invalid characters

    The characters described here are all the characters that are allowed to be inserted in an XML document.

    1.1. In XML 1.0

    • Reference: see XML recommendation 1.0, §2.2 Characters

    The global list of allowed characters is:

    [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

    Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity  is forbidden.

    1.2. In XML 1.1

    • Reference: see XML recommendation 1.1, §2.2 Characters, and 1.3 Rationale and list of changes for XML 1.1

    The global list of allowed characters is:

    [2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

    [2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

    This revision of the XML recommendation has extended the allowed characters so control characters are allowed, and takes into account a new revision of the Unicode standard, but these ones are still not allowed : NUL (x00), xFFFE, xFFFF...

    However, the use of control characters and undefined Unicode char is discouraged.

    It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected.

    2. Characters that need to be escaped (to obtain a well-formed document):

    The < must be escaped with a &lt; entity, since it is assumed to be the beginning of a tag.

    The & must be escaped with a &amp; entity, since it is assumed to be the beginning a entity reference

    The > should be escaped with &gt; entity. It is not mandatory -- it depends on the context -- but it is strongly advised to escape it.

    The ' should be escaped with a &apos; entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it.

    The " should be escaped with a &quot; entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it.

    0 讨论(0)
  • 2020-11-22 04:07

    In summary, valid characters in the text are:

    • tab, line-feed and carriage-return.
    • all non-control characters are valid except & and <.
    • > is not valid if following ]].

    Sections 2.2 and 2.4 of the XML specification provide the answer in detail:

    Characters

    Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646

    Character data

    The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

    0 讨论(0)
  • 2020-11-22 04:09

    Anyone tried this System.Security.SecurityElement.Escape(yourstring)? This will replace invalid XML characters in a string with their valid equivalent.

    0 讨论(0)
  • 2020-11-22 04:18

    This is a C# code to remove the XML invalid characters from a string and return a new valid string.

    public static string CleanInvalidXmlChars(string text) 
    { 
        // From xml spec valid chars: 
        // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
        // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
        string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"; 
        return Regex.Replace(text, re, ""); 
    }
    
    0 讨论(0)
  • 2020-11-22 04:18

    For Java folks, Apache has a utility class (StringEscapeUtils) that has a helper method escapeXml which can be used for escaping characters in a string using XML entities.

    0 讨论(0)
  • 2020-11-22 04:23

    The predeclared characters are:

    & < > " '
    

    See "What are the special characters in XML?" for more information.

    0 讨论(0)
提交回复
热议问题