Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-29 18:03:26

Comment: As it now only outputs results

Outputing results are only for demonstration, tracing and debuging.
To write a record and addresses into a SQL database, for example using sqlite3, do:

c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
    addr[1].update({'id': record['id']})
    addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)

To flatten for pandas
Preconditon outside the loop: df = pd.DataFrame()

from copy import copy

addresses = copy(record['addresses'])
del record['addresses']

df_records = []
for addr in addresses:
    record.update(addr[1])
    df_records.append(record)

df = df.append(df_records, ignore_index=True)

Question: Use etree.iterparse to include all nodes in XML file

The following class Entity do:

  • Parse the XML File using lxml.etree.iterparse.
  • There is no File size limit, as the <entity>...</entity> Element Tree are deleted after processing.
  • Builds from every <entity>...</entity> Tree a dict {tag, value, ...}.
  • Using of generator objects to yield the dict.
  • Sequence Elements, e.g. <addresses>/<address> are List of Tuple [(address, {tag, text})....

ToDo:

  • To flatten into many Records, loop record['addresses']
  • To equal different tag names: address and address1
  • To flatten, Sequence tags, e.g. <titels>, <probs> and <dobs>

from lxml import etree

class Entity:
    def __init__(self, fh):
        """
        Initialize 'iterparse' to only generate 'end' events on tag '<entity>'

        :param fh: File Handle from the XML File to parse
        """
        self.context = etree.iterparse(fh, events=("end",), tag=['entity'])

    def _parse(self):
        """
        Parse the XML File for all '<entity>...</entity>' Elements
        Clear/Delete the Element Tree after processing

        :return: Yield the current '<entity>...</entity>' Element Tree
        """
        for event, elem in self.context:
            yield elem

            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]

    def sequence(self, elements):
        """
        Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
        If found a nested Sequence Element, e.g. <address>,
          to a Tuple ('address', {tag, text})

        :param elements: The Sequence Element
        :return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
        """
        _elements = []
        for elem in elements:
            if len(elem):
                _elements.append((elem.tag, dict(self.sequence(elem))))
            else:
                _elements.append((elem.tag, elem.text))

        return _elements

    def __iter__(self):
        """
        Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()

        :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
        """
        for xml_entity in self._parse():
            entity = {'id': xml_entity.attrib['id']}

            for elem in xml_entity:
                # if elem is Sequence
                if len(elem):
                    # Append tuple(tag, value)
                    entity[elem.tag] = self.sequence(elem)
                else:
                    entity[elem.tag] = elem.text

            yield entity

if __name__ == "__main__":
    with open('.\\FILE.XML', 'rb') as in_xml_
        for record in Entity(in_xml):
            print("record:{}".format(record))

            for key, value in record.items():
                if isinstance(value, (list)):
                    #print_list(key, value)
                    print("{}:{}".format(key, value))
                else:
                    print("{}:{}".format(key, value))

Output: Shows only the first Record and only 4 fields.
Note: There is a pitfall with unique tag names: address and address1

record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity)
id:1124353
name:DAVID, Beckham
titles:[('title', 'Football player')]
addresses:
    address:{'city': 'London', 'address': None, 'post... (omitted for brevity)
    address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)

Tested with Python: 3.5 - lxml.etree: 3.7.1

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!