Why is MarkLogic Definitely Storing Invalid Characters in XML Documents?

问题

I have found that it is possible to store invalid XML characters in XML documents in the MarkLogic database, which causes problems when I try to update the text in a document when it involved needing to quote and unquote the XML data.

I now have example code that prove that invalid data can be stored. You can run this from Query Console, and you will get an error when trying to unquote the quotes string, due to the quoted string containing "", which was produced from the XML stored in the database.

let $Doc := <TEST>Here is invalid character 14: {fn:codepoints-to-string((14))}</TEST>
return
  xdmp:document-insert("/Test.xml", $Doc)

;

let $Quoted := xdmp:quote(/TEST)
let $Unquoted := xdmp:unquote($Quoted)
return
  $Unquoted

回答1:

MarkLogic is Document database, not just an XML database, so it makes no assumptions about the data you are inserting, even if the document URI has an xml extension or you are doing a node insert to an existing XML document.

This also means that it will accept xml with invalid characters. xdmp:node-insert-child() can be used with both xml, and json so it is up to you to either clean up/validate the data on ingest, or to handle exceptions on retrieval.

Schemas are one method that can be used for document validation.

Alternatively you can explicitly specify the XML version in a document:

Changes to Accepted XML Character Set

As of MarkLogic 9.0-6, parsing of XML documents with an XML declaration that explicitly specifies XML version 1.1 (version="1.1") enforces the XML 1.1 character set. Consequently, you can now create content containing characters not accepted by XML 1.0.

Characters in the XML 1.1 restricted character ranges must be given as character entities. This enforcement applies to the following character ranges:

0x1-0x8 0xB-0xC 0xE-0x1F 0x7F-0x84 0x86-0x9F The following character ranges that were previously disallowed are now accepted.

0x1-0x8 0xB-0xC 0xE-0x1F

来源：https://stackoverflow.com/questions/57449020/why-is-marklogic-definitely-storing-invalid-characters-in-xml-documents

标签

marklogic