IDE: Embarcadero XE5 c++ builder.
I\'m trying to dump UnicodeStrings in XML CData sections.
Small extract of such a string:
For my situation I created a function to trim a string to just the set of valid XML Characters.
Pseudocode:
//Code released into public domain. No attribution required.
function TrimToXmlText(xmlText: String): string;
begin
/*
http://www.w3.org/TR/xml/#NT-Char
Regarless of entity encoding, the only valid characters allowed are:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
I.e. any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
This means that a string such as
"Line one"#31#10"Line two"
is invalid (because of the #31 aka 0x1F).
This means we need to manually strip them out; because the xml library certainly won't do it for us.
*/
SetLength(Result, Length(xmlText));
Int32 o = 0;
for i = 1 to Length(s) do
begin
case Ord(s[i]) of
$9, $A, $D,
$20..$D7FF,
$E000..$FFFD:
begin
o = o+1;
Result[o] = xmlText[i];
end;
end;
end;
SetLength(Result, o);
end;
Turns out the problem was indeed all the escape characters present in the raw data string, as suspected.
Solved that by Base64-encoding the entire string before creating the XML CData-sections.
Rad Studio methods: EncodeBase64, DecodeBase64
Header: Soap.EncdDecd.hpp
If you read Section 2.7 of the XML specification, it describes the format of a CDATA section:
CDATA Sections
[18] CDSect ::= CDStart CData CDEnd
[19] CDStart ::= '<![CDATA['
[20] CData ::= (Char* - (Char* ']]>' Char*))
[21] CDEnd ::= ']]>'
Char
is defined in Section 2.2:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
If you look at your raw data, it contains over a dozen character values that are excluded from that range (specifically #x0
, #x1
, #x2
, #x4
, #x5
, #x6
, #x8
, #xB
#xE
, #x18
, #x19
, #x1A
, and #x1C
). That is why you are getting errors about illegal characters, because you really do have illegal characters.
A CDATA section does not give you permission to put arbitrary binary data into an XML data. A CDATA section is meant to be used when text content contains characters that are normally reserved for XML markup, so that they do not have to be escaped or encoded as entities. The only way to put binary data into an XML document is to encode it in an XML-compatible (typically 7bit ASCII) format, such as Base64 (but there are other formats available that you can use, such as yEnc).