iterparse

Why is elementtree.ElementTree.iterparse using so much memory?

落花浮王杯 提交于 2019-12-01 13:39:25
I am using elementtree.ElementTree.iterparse to parse a large (371 MB) xml file. My code is basically this: outf = open('out.txt', 'w') context = iterparse('copyright.xml') context = iter(context) dummy, root = context.next() for event, elem in context: if elem.tag == 'foo': author = elem.text elif elem.tag == 'bar': if elem.text is not None and 'bat' in elem.text.lower(): outf.write(elem.text + '\n') elem.clear() #line A root.clear() #line B My question is two-fold: First - Do I need both A and B (see code snippet comments)? I was told that root.clear() clears unnecessary children so memory

Parsing Large XML file with Python lxml and Iterparse

倖福魔咒の 提交于 2019-12-01 09:54:44
I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. My file is of the format: <item> <title>Item 1</title> <desc>Description 1</desc> <url> <item>http://www.url1.com</item> </url> </item> <item> <title>Item 2</title> <desc>Description 2</desc> <url> <item>http://www.url2.com</item> </url> </item> and so far my solution is: from lxml import etree context = etree.iterparse( MYFILE, tag='item' ) for event, elem in context : print elem.xpath( 'description/text( )' ) elem.clear( ) while elem.getprevious( ) is not None :

Parsing Large XML file with Python lxml and Iterparse

戏子无情 提交于 2019-12-01 07:28:47
问题 I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. My file is of the format: <item> <title>Item 1</title> <desc>Description 1</desc> <url> <item>http://www.url1.com</item> </url> </item> <item> <title>Item 2</title> <desc>Description 2</desc> <url> <item>http://www.url2.com</item> </url> </item> and so far my solution is: from lxml import etree context = etree.iterparse( MYFILE, tag='item' ) for event, elem in

Parsing huge, badly encoded XML files in Python

青春壹個敷衍的年華 提交于 2019-12-01 00:06:53
问题 I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles. I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml. Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse . The problem is that some of the XML files I

Ignore encoding errors in Python (iterparse)?

邮差的信 提交于 2019-11-30 23:59:37
I've been fighting with this for an hour now. I'm parsing an XML-string with iterparse . However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding. Here's the error I get: lxml.etree.XMLSyntaxError: line 8167: Input is not proper UTF-8, indicate encoding ! Bytes: 0xEA 0x76 0x65 0x73 How can I simply ignore this error and still continue on parsing? I don't mind, if one character is not saved properly, I just need the data. Here's what I've tried, all picked from internet: data = data.encode('UTF-8','ignore') data = unicode(data,errors='ignore') data

Iteratively parsing HTML (with lxml?)

佐手、 提交于 2019-11-30 08:40:34
问题 I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as: lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59 This then causes everything to stop. Is there a way to iteratively parse HTML without choking on syntax errors? At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document,

Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-29 18:03:26
I have written a small python script to parse XML data based on Liza Daly's blog in Python. However, my code does not parse all the nodes. So for example when a person has had multiple addresses then it takes only the first available address. The XML tree would look like this: - lgs - entities - entity - id - name - addressess - address - address1 - address - address1 - entity - id (...) and this would be the python script: import os import time from datetime import datetime import lxml.etree as ET import pandas as pd xml_file = '.\\FILE.XML' file_name, file_extension = os.path.splitext(os

Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

左心房为你撑大大i 提交于 2019-11-28 11:37:51
问题 I have written a small python script to parse XML data based on Liza Daly's blog in Python. However, my code does not parse all the nodes. So for example when a person has had multiple addresses then it takes only the first available address. The XML tree would look like this: - lgs - entities - entity - id - name - addressess - address - address1 - address - address1 - entity - id (...) and this would be the python script: import os import time from datetime import datetime import lxml.etree

ElementTree iterparse strategy

谁说我不能喝 提交于 2019-11-28 04:24:32
I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing). My concern is the following, imagine you have an xml like this <?xml version="1.0" encoding="UTF-8" ?> <families> <family> <name>Simpson</name> <members> <name>Homer</name> <name>Marge</name> <name>Bart</name> </members> </family> <family> <name>Griffin</name> <members> <name>Peter</name> <name>Brian</name> <name>Meg</name> </members> </family> </families> The problem is, of course to know when I am getting a family name (as Simpsons) and when I am

Why is lxml.etree.iterparse() eating up all my memory?

左心房为你撑大大i 提交于 2019-11-27 07:39:16
This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with iterparse() ? import lxml.etree for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'): print "why does this consume all my memory?" I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is