lxml.etree fromsting() and tostring() are not returning the same data

家住魔仙堡 提交于 2019-12-07 14:26:37

问题


I'm learning lxml (after using ElementTree) and I'm baffled why .fromstring and .tostring do not appear to be reversible. Here's my example:

import lxml.etree as ET
f = open('somefile.xml','r')
data = f.read()
tree_in = ET.fromstring(data)
tree_out = ET.tostring(tree_in)
f2 = open('samefile.xml','w')
f2.write(tree_out)
f2.close

'somefile.xml' was 132 KB. 'samefile.xml' - the output - was 113 KB, and it is missing the end of the file at some arbirtrary point. The closing tags of the overall tree and a few of the pieces of the final element are just gone.

Is there something wrong with my code, or must there be something wrong with the nesting in the original XML file? If so, am I forced to use BeautifulSoup of ElementTree again (without xpath)?

One note: The text inside many elements had a bunch of crap that was converted to text, but is that what's causing this problem?

Example:

<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer><Answer><![CDATA[NotConfirm]]></Answer></QuestionIndex>
<QuestionIndex Id="Actor"><Answer><![CDATA[GirlLt16]]></Answer><Answer><![CDATA[Fem17to25]]></Answer><Answer><![CDATA[BoyLt16]]></Answer><Answer><![CDATA[Mal17to25]]></Answer><Answer><![CDATA[Moth]]></Answer><Answer><![CDATA[Fath]]></Answer><Answer><![CDATA[Elder]]></Answer><Answer><![CDATA[RelLead]]></Answer><Answer><![CDATA[Auth]]></Answer><Answer><![CDATA[Teach]]></Answer><Answer><![CDATA[Oth]]></Answer></QuestionIndex>

回答1:


The "missing the end of the file at some arbirtrary point" problem is hard to explain without a complete reproducible example.

But I suspect that what you refer to as "a bunch of crap" are CDATA sections. You have several of those in your example (which is not a single well-formed XML document, btw).

In general, an XML parser is not obliged to preserve CDATA sections intact. Markup such as

<Answer><![CDATA[confirm]]></Answer>

is equivalent to

<Answer>confirm</Answer>    

However, the lxml.etree.XMLParser class takes a strip_cdata parameter that can be used to preserve CDATA sections. An instance of the parser can be passed to etree.fromstring(). Here is an example:

from lxml import etree 

XML = '<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer></QuestionIndex>'

print "Original size:", len(XML)
tree1 = etree.fromstring(XML)

out = etree.tostring(tree1)
print "With CDATA stripped:", len(out)
print out

parser = etree.XMLParser(strip_cdata=False)
tree2 = etree.fromstring(XML, parser)

out = etree.tostring(tree2)
print "With CDATA kept:", len(out)
print out

=>

Original size: 77
With CDATA stripped: 65
<QuestionIndex Id="Perm"><Answer>confirm</Answer></QuestionIndex>
With CDATA kept: 77
<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer></QuestionIndex>



回答2:


This problem turned out to be way simpler than it appears, and the answer is hidden in the code I provided.

f.close

should have been

f.close()

The difference is the remaining buffer of a few dozen characters that never made it into the notepad++ file I was checking results in. Closing the file for real made all the difference, and the code works.



来源:https://stackoverflow.com/questions/9027081/lxml-etree-fromsting-and-tostring-are-not-returning-the-same-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!