问题
I have an XML file with nodes that looks like this:
<trkpt lat="-37.7944415" lon="144.9616159">
<ele>41.3681107</ele>
<time>2015-04-11T03:52:33.000Z</time>
<speed>3.9598</speed>
</trkpt>
I am using lxml.etree.iterparse() to iteratively parse the tree. I loop over each trkpt element's children and want to print the text value of the children nodes. E.g.
for event, element in etree.iterparse(infile, events=("start", "end")):
if element.tag == NAMESPACE + 'trkpt':
for child in list(element):
print child.text
The problem is that at this stage the node has no text, so the output of the print is 'None'.
I have validated this by replacing the 'print child.text' statement with 'print etree.tostring(child)' and the output looks like this
<ele/>
<time/>
<speed/>
According to the documentation, "Note that the text, tail, and children of an Element are not necessarily present yet when receiving the start event. Only the end event guarantees that the Element has been parsed completely."
So I changed my for loop to this, note the 'if event == "end":' statement
for event, element in etree.iterparse(infile, events=("start", "end")):
if element.tag == NAMESPACE + 'trkpt':
if event == "end":
for child in list(element):
print child.text
But I am still getting the same results. Any help would be greatly appreciated.
回答1:
are you trying to use iterparse explicitly or can you use other methods.
e.g.
from lxml import etree
tree = etree.parse('/path/to/file')
root = tree.getroot()
for elements in root.findall('trkpt'):
for child in elements:
print child.text
lxml is pretty good at parsing and not taking up too much memory...not sure if this solves your problem or if you are trying to use the specific method above.
回答2:
Are you sure that you don't call e.g. element.clear()
after your conditional statement, like this?
for event, element in etree.iterparse(infile, events=("start", "end")):
if element.tag == NAMESPACE + 'trkpt' and event == 'end':
for child in list(element):
print child.text
element.clear()
The problem is that the parser issues the events for the child elements before it sends the end
event for trkpt
(because it encounters the end tags of the nested elements first). If you do any modifications to the parsed elements before the end
event is called for the outer element, the behaviour you describe may occur.
Consider the following alternative:
for event, element in etree.iterparse(infile, events=('end',),
tag=NAMESPACE + 'trkpt'):
for child in element:
print child.text
element.clear()
来源:https://stackoverflow.com/questions/29689256/lxml-etree-iterparse-and-parsing-element-completely