lxml.etree fromsting() and tostring() are not returning the same data

I'm learning lxml (after using ElementTree) and I'm baffled why .fromstring and .tostring do not appear to be reversible. Here's my example:

import lxml.etree as ET
f = open('somefile.xml','r')
data = f.read()
tree_in = ET.fromstring(data)
tree_out = ET.tostring(tree_in)
f2 = open('samefile.xml','w')
f2.write(tree_out)
f2.close

'somefile.xml' was 132 KB. 'samefile.xml' - the output - was 113 KB, and it is missing the end of the file at some arbirtrary point. The closing tags of the overall tree and a few of the pieces of the final element are just gone.

Is there something wrong with my code, or must there be something wrong with the nesting in the original XML file? If so, am I forced to use BeautifulSoup of ElementTree again (without xpath)?

One note: The text inside many elements had a bunch of crap that was converted to text, but is that what's causing this problem?

Example:

<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer><Answer><![CDATA[NotConfirm]]></Answer></QuestionIndex>
<QuestionIndex Id="Actor"><Answer><![CDATA[GirlLt16]]></Answer><Answer><![CDATA[Fem17to25]]></Answer><Answer><![CDATA[BoyLt16]]></Answer><Answer><![CDATA[Mal17to25]]></Answer><Answer><![CDATA[Moth]]></Answer><Answer><![CDATA[Fath]]></Answer><Answer><![CDATA[Elder]]></Answer><Answer><![CDATA[RelLead]]></Answer><Answer><![CDATA[Auth]]></Answer><Answer><![CDATA[Teach]]></Answer><Answer><![CDATA[Oth]]></Answer></QuestionIndex>

The "missing the end of the file at some arbirtrary point" problem is hard to explain without a complete reproducible example.

But I suspect that what you refer to as "a bunch of crap" are CDATA sections. You have several of those in your example (which is not a single well-formed XML document, btw).

In general, an XML parser is not obliged to preserve CDATA sections intact. Markup such as

<Answer><![CDATA[confirm]]></Answer>

is equivalent to

<Answer>confirm</Answer>

However, the lxml.etree.XMLParser class takes a strip_cdata parameter that can be used to preserve CDATA sections. An instance of the parser can be passed to etree.fromstring(). Here is an example:

from lxml import etree 

XML = '<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer></QuestionIndex>'

print "Original size:", len(XML)
tree1 = etree.fromstring(XML)

out = etree.tostring(tree1)
print "With CDATA stripped:", len(out)
print out

parser = etree.XMLParser(strip_cdata=False)
tree2 = etree.fromstring(XML, parser)

out = etree.tostring(tree2)
print "With CDATA kept:", len(out)
print out

Original size: 77
With CDATA stripped: 65
<QuestionIndex Id="Perm"><Answer>confirm</Answer></QuestionIndex>
With CDATA kept: 77
<QuestionIndex Id="Perm"><Answer><![CDATA[confirm]]></Answer></QuestionIndex>

This problem turned out to be way simpler than it appears, and the answer is hidden in the code I provided.

f.close

should have been

f.close()

The difference is the remaining buffer of a few dozen characters that never made it into the notepad++ file I was checking results in. Closing the file for real made all the difference, and the code works.

来源：https://stackoverflow.com/questions/9027081/lxml-etree-fromsting-and-tostring-are-not-returning-the-same-data

标签

python

lxml

tostring