iterparse

ElementTree iterparse strategy

杀马特。学长 韩版系。学妹 提交于 2019-11-27 00:18:12
问题 I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing). My concern is the following, imagine you have an xml like this <?xml version="1.0" encoding="UTF-8" ?> <families> <family> <name>Simpson</name> <members> <name>Homer</name> <name>Marge</name> <name>Bart</name> </members> </family> <family> <name>Griffin</name> <members> <name>Peter</name> <name>Brian</name> <name>Meg</name> </members> </family> <

Why is lxml.etree.iterparse() eating up all my memory?

半世苍凉 提交于 2019-11-26 13:44:54
问题 This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference. What am I doing wrong / how can I process this large file with iterparse() ? import lxml.etree for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'): print "why does this consume all my memory?" I can easily cut it up and process it in smaller chunks but that's uglier than I'd like. 回答1: As

using lxml and iterparse() to parse a big (+- 1Gb) XML file

橙三吉。 提交于 2019-11-26 09:01:21
问题 I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags \"Author\" and \"Content\": <Database> <BlogPost> <Date>MM/DD/YY</Date> <Author>Last Name, Name</Author> <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content> </BlogPost> <BlogPost> <Date>MM/DD/YY</Date> <Author>Last Name, Name</Author> <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.<