I have an XML file from a client that has greater than >
and less than <
signs in it and it fails an XML format check.
Is there a way to get
The direct answer to your question:
Is there a way to get around this without asking the client to fix the file ?
is "no". The data you are getting is not valid XML, and you are correct in rejecting it. I highly recommend going back to the client and saying that they must provide valid XML, using Character Entity References as mentioned by David and Rahul.
To answer your question plainly no, you cannot have an XML file with <
or >
in any of its value fields because the XML format uses these characters to signify the parent and child elements, e.g. <note>
, <to>
, <from>
, etc.
Expanding on my answer: When a Python script writes <
or >
using the XML library, the library translates them to <
or >
, respectively. I don't believe this is possible with that library since it is actually filtering out the <
and >
characters as well as the Character Entity References. This makes sense - the XML library is preventing you from disrupting the syntax used for the parent xml.etree.cElementTree.Element
or any child xml.etree.cElementTree.SubElement
object fields. For example, use the code block in this great answer to experiment:
import xml.etree.cElementTree as ET
root = ET.Element("root")
doc = ET.SubElement(root, "doc")
ET.SubElement(doc, "field1", name="blah").text = "some <value>"
ET.SubElement(doc, "field2", name="asdfasd").text = "some <other value>"
tree = ET.ElementTree(root)
tree.write("filename.xml")
This yields <root><doc><field1 name="blah">some <value></field1><field2 name="asdfasd">some <other value></field2></doc></root>
.
Prettifying it:
<root>
<doc>
<field1 name="blah">
some <value>
</field1>
<field2 name="asdfasd">
some <other value>
</field2>
</doc>
</root>
However, there's nothing stopping you from adding these characters manually: read in the XML file and re-write it, adding text, even if it contains <
or >
. If you want a proper XML file though, just be sure that these characters are only used within comment fields.
For your particular problem, you could read in the lines from the client's XML files, then either remove the <
and >
characters or, if the client requires them, move them to a commented portion of the line. Part of the challenge is that you have to leave in the <note>,
`, etc. portions of the file... This is challenging but it would be possible!
The following is what I'd expect the result to look like.
<?xml version="1.0" encoding="UTF-8"?>
<note Name="PrintPgmInfo VDD"> <!-- PrintPgmInfo <> VDD -->
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
You will have to use XML escape characters:
" to "
' to '
< to <
> to >
& to &
Google escaping characters in XML for more information.
You can use the similarly looking full-width less-than (U+FF1C) and full-width greater-than (0xFF1E) signs: <>
These Unicode characters do not require special encoding.
You may try to use it like this:
< = <
> = >
These are known as Character Entity References