lxml.etree iterparse() and parsing element completely

落爺英雄遲暮 提交于 2019-12-11 10:36:52

问题


I have an XML file with nodes that looks like this:

<trkpt lat="-37.7944415" lon="144.9616159">
  <ele>41.3681107</ele>
  <time>2015-04-11T03:52:33.000Z</time>
  <speed>3.9598</speed>
</trkpt>

I am using lxml.etree.iterparse() to iteratively parse the tree. I loop over each trkpt element's children and want to print the text value of the children nodes. E.g.

for event, element in etree.iterparse(infile, events=("start", "end")):
    if element.tag == NAMESPACE + 'trkpt':
        for child in list(element):
            print child.text

The problem is that at this stage the node has no text, so the output of the print is 'None'.

I have validated this by replacing the 'print child.text' statement with 'print etree.tostring(child)' and the output looks like this

<ele/>
<time/>
<speed/>    

According to the documentation, "Note that the text, tail, and children of an Element are not necessarily present yet when receiving the start event. Only the end event guarantees that the Element has been parsed completely."

So I changed my for loop to this, note the 'if event == "end":' statement

for event, element in etree.iterparse(infile, events=("start", "end")):
    if element.tag == NAMESPACE + 'trkpt':
        if event == "end":
            for child in list(element):
                print child.text

But I am still getting the same results. Any help would be greatly appreciated.


回答1:


are you trying to use iterparse explicitly or can you use other methods.

e.g.

from lxml import etree

tree = etree.parse('/path/to/file')
root = tree.getroot()
for elements in root.findall('trkpt'):
    for child in elements:
        print child.text

lxml is pretty good at parsing and not taking up too much memory...not sure if this solves your problem or if you are trying to use the specific method above.




回答2:


Are you sure that you don't call e.g. element.clear() after your conditional statement, like this?

for event, element in etree.iterparse(infile, events=("start", "end")):
  if element.tag == NAMESPACE + 'trkpt' and event == 'end':
    for child in list(element):
        print child.text
  element.clear()

The problem is that the parser issues the events for the child elements before it sends the end event for trkpt (because it encounters the end tags of the nested elements first). If you do any modifications to the parsed elements before the end event is called for the outer element, the behaviour you describe may occur.

Consider the following alternative:

for event, element in etree.iterparse(infile, events=('end',),
    tag=NAMESPACE + 'trkpt'):
  for child in element:
     print child.text
  element.clear()


来源:https://stackoverflow.com/questions/29689256/lxml-etree-iterparse-and-parsing-element-completely

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!