Using python ElementTree's itertree function and writing modified tree to output file

前端 未结 2 941
萌比男神i
萌比男神i 2021-01-05 00:25

I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I\'ve been trying to use iterparse from python\'s Ele

相关标签:
2条回答
  • 2021-01-05 00:38

    Perhaps the answer to my similar question can help you out.

    As for how to write this back to an .xml file, I ended up doing this at the bottom of my script:

    with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original
        for line in ET.tostring(doc):
            t.write(line)
        t.close
    print('File.xml Complete') # Console message that file wrote successfully, can be omitted
    

    The variable doc is from earlier on in my script, comparable to where you have tree = ET.iterparse("sample.xml") I have this:

    doc = ET.parse(filename)
    

    I've been using lxml instead of ElementTree but I think the write out part should still work (I think it's mainly just xpath stuff that ElementTree can't handle.) I'm using lxml imported with this line:

    from lxml import etree as ET
    

    Hopefully this (along with my linked question for some additional code context if you need it) can help you out!

    0 讨论(0)
  • 2021-01-05 00:51

    If you have a large xml that doesn't fit in memory then you could try to serialize it one element at a time. For example, assuming <root><page/><page/><page/>...</root> document structure and ignoring possible namespace issues:

    import xml.etree.cElementTree as etree
    
    def getelements(filename_or_file, tag):
        context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
        _, root = next(context) # get root element
        for event, elem in context:
            if event == 'end' and elem.tag == tag:
                yield elem
                root.clear() # free memory
    
    with open('output.xml', 'wb') as file:
        # start root
        file.write(b'<root>')
    
        for page in getelements('sample.xml', 'page'):
            if keep(page):
                file.write(etree.tostring(page, encoding='utf-8'))
    
        # close root
        file.write(b'</root>')
    

    where keep(page) returns True if page should be kept e.g.:

    import re
    
    def keep(page):
        # all <revision> elements must have 20xx in them
        return all(re.search(r'20\d\d', rev.text)
                   for rev in page.iterfind('revision'))
    

    For comparison, to modify a small xml file, you could:

    # parse small xml
    tree = etree.parse('sample.xml')
    
    # remove some root/page elements from xml
    root = tree.getroot()
    for page in root.findall('page'):
        if not keep(page):
            root.remove(page) # modify inplace
    
    # write to a file modified xml tree
    tree.write('output.xml', encoding='utf-8')
    
    0 讨论(0)
提交回复
热议问题