iterparse | 易学教程

Why is elementtree.ElementTree.iterparse using so much memory?

阅读更多关于 Why is elementtree.ElementTree.iterparse using so much memory?

I am using elementtree.ElementTree.iterparse to parse a large (371 MB) xml file. My code is basically this: outf = open('out.txt', 'w') context = iterparse('copyright.xml') context = iter(context) dummy, root = context.next() for event, elem in context: if elem.tag == 'foo': author = elem.text elif elem.tag == 'bar': if elem.text is not None and 'bat' in elem.text.lower(): outf.write(elem.text + '\n') elem.clear() #line A root.clear() #line B My question is two-fold: First - Do I need both A and B (see code snippet comments)? I was told that root.clear() clears unnecessary children so memory

Parsing Large XML file with Python lxml and Iterparse

阅读更多关于 Parsing Large XML file with Python lxml and Iterparse

I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. My file is of the format: <item> <title>Item 1</title> <desc>Description 1</desc> <url> <item>http://www.url1.com</item> </url> </item> <item> <title>Item 2</title> <desc>Description 2</desc> <url> <item>http://www.url2.com</item> </url> </item> and so far my solution is: from lxml import etree context = etree.iterparse( MYFILE, tag='item' ) for event, elem in context : print elem.xpath( 'description/text( )' ) elem.clear( ) while elem.getprevious( ) is not None :

Parsing Large XML file with Python lxml and Iterparse

阅读更多关于 Parsing Large XML file with Python lxml and Iterparse

问题 I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. My file is of the format: <item> <title>Item 1</title> <desc>Description 1</desc> <url> <item>http://www.url1.com</item> </url> </item> <item> <title>Item 2</title> <desc>Description 2</desc> <url> <item>http://www.url2.com</item> </url> </item> and so far my solution is: from lxml import etree context = etree.iterparse( MYFILE, tag='item' ) for event, elem in

Parsing huge, badly encoded XML files in Python

阅读更多关于 Parsing huge, badly encoded XML files in Python

问题 I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles. I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml. Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse . The problem is that some of the XML files I

Ignore encoding errors in Python (iterparse)?

阅读更多关于 Ignore encoding errors in Python (iterparse)?

I've been fighting with this for an hour now. I'm parsing an XML-string with iterparse . However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding. Here's the error I get: lxml.etree.XMLSyntaxError: line 8167: Input is not proper UTF-8, indicate encoding ! Bytes: 0xEA 0x76 0x65 0x73 How can I simply ignore this error and still continue on parsing? I don't mind, if one character is not saved properly, I just need the data. Here's what I've tried, all picked from internet: data = data.encode('UTF-8','ignore') data = unicode(data,errors='ignore') data

Iteratively parsing HTML (with lxml?)

阅读更多关于 Iteratively parsing HTML (with lxml?)

问题 I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as: lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59 This then causes everything to stop. Is there a way to iteratively parse HTML without choking on syntax errors? At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document,

Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

阅读更多关于 Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

I have written a small python script to parse XML data based on Liza Daly's blog in Python. However, my code does not parse all the nodes. So for example when a person has had multiple addresses then it takes only the first available address. The XML tree would look like this: - lgs - entities - entity - id - name - addressess - address - address1 - address - address1 - entity - id (...) and this would be the python script: import os import time from datetime import datetime import lxml.etree as ET import pandas as pd xml_file = '.\\FILE.XML' file_name, file_extension = os.path.splitext(os

Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

阅读更多关于 Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

问题 I have written a small python script to parse XML data based on Liza Daly's blog in Python. However, my code does not parse all the nodes. So for example when a person has had multiple addresses then it takes only the first available address. The XML tree would look like this: - lgs - entities - entity - id - name - addressess - address - address1 - address - address1 - entity - id (...) and this would be the python script: import os import time from datetime import datetime import lxml.etree

ElementTree iterparse strategy

阅读更多关于 ElementTree iterparse strategy

I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing). My concern is the following, imagine you have an xml like this <?xml version="1.0" encoding="UTF-8" ?> <families> <family> <name>Simpson</name> <members> <name>Homer</name> <name>Marge</name> <name>Bart</name> </members> </family> <family> <name>Griffin</name> <members> <name>Peter</name> <name>Brian</name> <name>Meg</name> </members> </family> </families> The problem is, of course to know when I am getting a family name (as Simpsons) and when I am

Why is lxml.etree.iterparse() eating up all my memory?

阅读更多关于 Why is lxml.etree.iterparse() eating up all my memory?

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with iterparse() ? import lxml.etree for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'): print "why does this consume all my memory?" I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is