I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I\'ve been trying to use iterparse from python\'s Ele
Perhaps the answer to my similar question can help you out.
As for how to write this back to an .xml file, I ended up doing this at the bottom of my script:
with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original
for line in ET.tostring(doc):
t.write(line)
t.close
print('File.xml Complete') # Console message that file wrote successfully, can be omitted
The variable doc
is from earlier on in my script, comparable to where you have tree = ET.iterparse("sample.xml")
I have this:
doc = ET.parse(filename)
I've been using lxml instead of ElementTree but I think the write out part should still work (I think it's mainly just xpath stuff that ElementTree can't handle.) I'm using lxml imported with this line:
from lxml import etree as ET
Hopefully this (along with my linked question for some additional code context if you need it) can help you out!
If you have a large xml that doesn't fit in memory then you could try to serialize it one element at a time. For example, assuming <root><page/><page/><page/>...</root>
document structure and ignoring possible namespace issues:
import xml.etree.cElementTree as etree
def getelements(filename_or_file, tag):
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
with open('output.xml', 'wb') as file:
# start root
file.write(b'<root>')
for page in getelements('sample.xml', 'page'):
if keep(page):
file.write(etree.tostring(page, encoding='utf-8'))
# close root
file.write(b'</root>')
where keep(page)
returns True
if page
should be kept e.g.:
import re
def keep(page):
# all <revision> elements must have 20xx in them
return all(re.search(r'20\d\d', rev.text)
for rev in page.iterfind('revision'))
For comparison, to modify a small xml file, you could:
# parse small xml
tree = etree.parse('sample.xml')
# remove some root/page elements from xml
root = tree.getroot()
for page in root.findall('page'):
if not keep(page):
root.remove(page) # modify inplace
# write to a file modified xml tree
tree.write('output.xml', encoding='utf-8')