lxml.etree fromsting() and tostring() are not returning the same data

99封情书 提交于 2019-12-06 01:15:40

The "missing the end of the file at some arbirtrary point" problem is hard to explain without a complete reproducible example.

But I suspect that what you refer to as "a bunch of crap" are CDATA sections. You have several of those in your example (which is not a single well-formed XML document, btw).

In general, an XML parser is not obliged to preserve CDATA sections intact. Markup such as

<Answer><![CDATA[confirm]]></Answer>

is equivalent to

<Answer>confirm</Answer>    

However, the lxml.etree.XMLParser class takes a strip_cdata parameter that can be used to preserve CDATA sections. An instance of the parser can be passed to etree.fromstring(). Here is an example:

from lxml import etree 

XML = '<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer></QuestionIndex>'

print "Original size:", len(XML)
tree1 = etree.fromstring(XML)

out = etree.tostring(tree1)
print "With CDATA stripped:", len(out)
print out

parser = etree.XMLParser(strip_cdata=False)
tree2 = etree.fromstring(XML, parser)

out = etree.tostring(tree2)
print "With CDATA kept:", len(out)
print out

=>

Original size: 77
With CDATA stripped: 65
<QuestionIndex Id="Perm"><Answer>confirm</Answer></QuestionIndex>
With CDATA kept: 77
<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer></QuestionIndex>

This problem turned out to be way simpler than it appears, and the answer is hidden in the code I provided.

f.close

should have been

f.close()

The difference is the remaining buffer of a few dozen characters that never made it into the notepad++ file I was checking results in. Closing the file for real made all the difference, and the code works.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!