lxml memory usage when parsing huge xml in python

后端 未结 1 1997
北海茫月
北海茫月 2020-12-29 17:33

I am a python newbie. I am trying to parse a huge xml file in my python module using lxml. In spite of clearing the elements at the end of each loop, my memory shoots up and

相关标签:
1条回答
  • 2020-12-29 18:01

    Welcome to Python and Stack Overflow!

    It looks like you've followed some good advice looking at lxml and especially etree.iterparse(..), but I think your implementation is approaching the problem from the wrong angle. The idea of iterparse(..) is to get away from collecting and storing data, and instead processing tags as they get read in. Your readAllChildren(..) function is saving everything to rowList, which grows and grows to cover the whole document tree. I made a few changes to show what's going on:

    from lxml import etree
    def parseXml(context,attribList):
        for event, element in context:
            print "%s element %s:" % (event, element)
            fieldMap = {}
            rowList = []
            readAttribs(element, fieldMap, attribList)
            readAllChildren(element, fieldMap, attribList, rowList)
            for row in rowList:
                yield row
            element.clear()
    
    def readAttribs(element, fieldMap, attribList):
        for attrib in attribList:
            fieldMap[attrib] = element.get(attrib,'')
        print "fieldMap:", fieldMap
    
    def readAllChildren(element, fieldMap, attribList, rowList):
        for childElem in element:
            print "Found child:", childElem
            readAttribs(childElem, fieldMap, attribList)
            if len(childElem) > 0:
               readAllChildren(childElem, fieldMap, attribList, rowList)
            rowList.append(fieldMap.copy())
            print "len(rowList) =", len(rowList)
            childElem.clear()
    
    def process_xml_original(xml_file):
        attribList=['name','age','id']
        context=etree.iterparse(xml_file, events=("start",))
        for row in parseXml(context,attribList):
            print "Row:", row
    

    Running with some dummy data:

    >>> from cStringIO import StringIO
    >>> test_xml = """\
    ... <family>
    ...     <person name="somebody" id="5" />
    ...     <person age="45" />
    ...     <person name="Grandma" age="62">
    ...         <child age="35" id="10" name="Mom">
    ...             <grandchild age="7 and 3/4" />
    ...             <grandchild id="12345" />
    ...         </child>
    ...     </person>
    ...     <something-completely-different />
    ... </family>
    ... """
    >>> process_xml_original(StringIO(test_xml))
    start element: <Element family at 0x105ca58>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    Found child: <Element person at 0x105ca80>
    fieldMap: {'age': '', 'name': 'somebody', 'id': '5'}
    len(rowList) = 1
    Found child: <Element person at 0x105c468>
    fieldMap: {'age': '45', 'name': '', 'id': ''}
    len(rowList) = 2
    Found child: <Element person at 0x105c7b0>
    fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''}
    Found child: <Element child at 0x106e468>
    fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'}
    Found child: <Element grandchild at 0x106e148>
    fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''}
    len(rowList) = 3
    Found child: <Element grandchild at 0x106e490>
    fieldMap: {'age': '', 'name': '', 'id': '12345'}
    len(rowList) = 4
    len(rowList) = 5
    len(rowList) = 6
    Found child: <Element something-completely-different at 0x106e4b8>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    len(rowList) = 7
    Row: {'age': '', 'name': 'somebody', 'id': '5'}
    Row: {'age': '45', 'name': '', 'id': ''}
    Row: {'age': '7 and 3/4', 'name': '', 'id': ''}
    Row: {'age': '', 'name': '', 'id': '12345'}
    Row: {'age': '', 'name': '', 'id': '12345'}
    Row: {'age': '', 'name': '', 'id': '12345'}
    Row: {'age': '', 'name': '', 'id': ''}
    start element: <Element person at 0x105ca80>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    start element: <Element person at 0x105c468>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    start element: <Element person at 0x105c7b0>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    start element: <Element child at 0x106e468>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    start element: <Element grandchild at 0x106e148>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    start element: <Element grandchild at 0x106e490>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    start element: <Element something-completely-different at 0x106e4b8>
    fieldMap: {'age': '', 'name': '', 'id': ''}
    

    It's a little hard to read but you can see it's climbing the whole tree down from the root tag on the first pass, building up rowList for every element in the entire document. You'll also notice it's not even stopping there, since the element.clear() call comes after the yield statment in parseXml(..), it doesn't get executed until the second iteration (i.e. the next element in the tree).

    Incremental processing FTW

    A simple fix is to let iterparse(..) do its job: parse iteratively! The following will pull the same information and process it incrementally instead:

    def do_something_with_data(data):
        """This just prints it out. Yours will probably be more interesting."""
        print "Got data: ", data
    
    def process_xml_iterative(xml_file):
        # by using the default 'end' event, you start at the _bottom_ of the tree
        ATTRS = ('name', 'age', 'id')
        for event, element in etree.iterparse(xml_file):
            print "%s element: %s" % (event, element)
            data = {}
            for attr in ATTRS:
                data[attr] = element.get(attr, u"")
            do_something_with_data(data)
            element.clear()
            del element # for extra insurance
    

    Running on the same dummy XML:

    >>> print test_xml
    <family>
        <person name="somebody" id="5" />
        <person age="45" />
        <person name="Grandma" age="62">
            <child age="35" id="10" name="Mom">
                <grandchild age="7 and 3/4" />
                <grandchild id="12345" />
            </child>
        </person>
        <something-completely-different />
    </family>
    >>> process_xml_iterative(StringIO(test_xml))
    end element: <Element person at 0x105cc10>
    Got data:  {'age': u'', 'name': 'somebody', 'id': '5'}
    end element: <Element person at 0x106e468>
    Got data:  {'age': '45', 'name': u'', 'id': u''}
    end element: <Element grandchild at 0x106e148>
    Got data:  {'age': '7 and 3/4', 'name': u'', 'id': u''}
    end element: <Element grandchild at 0x106e490>
    Got data:  {'age': u'', 'name': u'', 'id': '12345'}
    end element: <Element child at 0x106e508>
    Got data:  {'age': '35', 'name': 'Mom', 'id': '10'}
    end element: <Element person at 0x106e530>
    Got data:  {'age': '62', 'name': 'Grandma', 'id': u''}
    end element: <Element something-completely-different at 0x106e558>
    Got data:  {'age': u'', 'name': u'', 'id': u''}
    end element: <Element family at 0x105c6e8>
    Got data:  {'age': u'', 'name': u'', 'id': u''}
    

    This should greatly improve both the speed and memory performance of your script. Also, by hooking the 'end' event, you're free to clear and delete elements as you go, rather than waiting until all children have been processed.

    Depending on your dataset, it might be a good idea to only process certain types of elements. The root element, for one, probably isn't very meaningful, and other nested elements may also fill your dataset with a lot of {'age': u'', 'id': u'', 'name': u''}.


    Or, with SAX

    As an aside, when I read "XML" and "low-memory" my mind always jumps straight to SAX, which is another way you could attack this problem. Using the builtin xml.sax module:

    import xml.sax
    
    class AttributeGrabber(xml.sax.handler.ContentHandler):
        """SAX Handler which will store selected attribute values."""
        def __init__(self, target_attrs=()):
            self.target_attrs = target_attrs
    
        def startElement(self, name, attrs):
            print "Found element: ", name
            data = {}
            for target_attr in self.target_attrs:
                data[target_attr] = attrs.get(target_attr, u"")
    
            # (no xml trees or elements created at all)
            do_something_with_data(data)
    
    def process_xml_sax(xml_file):
        grabber = AttributeGrabber(target_attrs=('name', 'age', 'id'))
        xml.sax.parse(xml_file, grabber)
    

    You'll have to evaluate both options based on what works best in your situation (and maybe run a couple benchmarks, if this is something you'll be doing often).


    Be sure to follow up with how things work out!


    Edit based on follow-up comments

    Implementing either of the above solutions may require some changes to the overall structure of your code, but anything you have should still be doable. For instance, processing "rows" in batches, you could have:

    def process_xml_batch(xml_file, batch_size=10):
        ATTRS = ('name', 'age', 'id')
        batch = []
        for event, element in etree.iterparse(xml_file):
            data = {}
            for attr in ATTRS:
                data[attr] = element.get(attr, u"")
            batch.append(data)
            element.clear()
            del element
    
            if len(batch) == batch_size:
                do_something_with_batch(batch)
                # Or, if you want this to be a genrator:
                # yield batch
                batch = []
        if batch:
            # there are leftover items
            do_something_with_batch(batch) # Or, yield batch
    
    0 讨论(0)
提交回复
热议问题