问题
I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix
, right at the start of <abc:thing2>
)
Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.
Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.
I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:
- tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found
register_namespace
, but that does not seem to work. - have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
- tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
- use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
- some other route that I am currently not seeing?
UPDATE:
After Har07 put me on the path of lxml
, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:
- telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right
xmlns
declarations, and then handing it off tolxml.etree
'sfromstring
method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though. - Read in the DTD before parsing: it is possible with
lxml
(throughattribute_defaults
,dtd_validation
, orload_dtd
), but unfortunately does not solve the namespace issue. - Telling
lxml
not to bother about namespaces: possible through therecover
option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)
回答1:
One possible way is using ElementTree
compatible library, lxml. For example :
from lxml import etree as ElementTree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))
All you need to do for parsing a non well-formed XML using lxml
is passing parameter recover=True
to constructor of XMLParser
. lxml
also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.
UPDATE :
I don't know all the types of XML error that recover=True
option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml
will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :
xml = """<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
print(ElementTree.tostring(tree))
The final output XML after parsed by lxml
is as follow :
<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</bad></item>
来源:https://stackoverflow.com/questions/30597100/parsing-xml-with-undeclared-prefixes-in-python