Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

后端 未结 1 1421
攒了一身酷
攒了一身酷 2020-12-22 10:15

I have written a small python script to parse XML data based on Liza Daly\'s blog in Python. However, my code does not parse all the nodes. So for example w

相关标签:
1条回答
  • 2020-12-22 10:32

    Comment: As it now only outputs results

    Outputing results are only for demonstration, tracing and debuging.
    To write a record and addresses into a SQL database, for example using sqlite3, do:

    c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
    addresses = []
    for addr in record['addresses']:
        addr[1].update({'id': record['id']})
        addresses.append(addr[1])
    c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)
    

    To flatten for pandas
    Preconditon outside the loop: df = pd.DataFrame()

    from copy import copy
    
    addresses = copy(record['addresses'])
    del record['addresses']
    
    df_records = []
    for addr in addresses:
        record.update(addr[1])
        df_records.append(record)
    
    df = df.append(df_records, ignore_index=True)
    

    Question: Use etree.iterparse to include all nodes in XML file

    The following class Entity do:

    • Parse the XML File using lxml.etree.iterparse.
    • There is no File size limit, as the <entity>...</entity> Element Tree are deleted after processing.
    • Builds from every <entity>...</entity> Tree a dict {tag, value, ...}.
    • Using of generator objects to yield the dict.
    • Sequence Elements, e.g. <addresses>/<address> are List of Tuple [(address, {tag, text})....

    ToDo:

    • To flatten into many Records, loop record['addresses']
    • To equal different tag names: address and address1
    • To flatten, Sequence tags, e.g. <titels>, <probs> and <dobs>

    from lxml import etree
    
    class Entity:
        def __init__(self, fh):
            """
            Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
    
            :param fh: File Handle from the XML File to parse
            """
            self.context = etree.iterparse(fh, events=("end",), tag=['entity'])
    
        def _parse(self):
            """
            Parse the XML File for all '<entity>...</entity>' Elements
            Clear/Delete the Element Tree after processing
    
            :return: Yield the current '<entity>...</entity>' Element Tree
            """
            for event, elem in self.context:
                yield elem
    
                elem.clear()
                while elem.getprevious() is not None:
                    del elem.getparent()[0]
    
        def sequence(self, elements):
            """
            Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
            If found a nested Sequence Element, e.g. <address>,
              to a Tuple ('address', {tag, text})
    
            :param elements: The Sequence Element
            :return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
            """
            _elements = []
            for elem in elements:
                if len(elem):
                    _elements.append((elem.tag, dict(self.sequence(elem))))
                else:
                    _elements.append((elem.tag, elem.text))
    
            return _elements
    
        def __iter__(self):
            """
            Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()
    
            :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
            """
            for xml_entity in self._parse():
                entity = {'id': xml_entity.attrib['id']}
    
                for elem in xml_entity:
                    # if elem is Sequence
                    if len(elem):
                        # Append tuple(tag, value)
                        entity[elem.tag] = self.sequence(elem)
                    else:
                        entity[elem.tag] = elem.text
    
                yield entity
    
    if __name__ == "__main__":
        with open('.\\FILE.XML', 'rb') as in_xml_
            for record in Entity(in_xml):
                print("record:{}".format(record))
    
                for key, value in record.items():
                    if isinstance(value, (list)):
                        #print_list(key, value)
                        print("{}:{}".format(key, value))
                    else:
                        print("{}:{}".format(key, value))
    

    Output: Shows only the first Record and only 4 fields.
    Note: There is a pitfall with unique tag names: address and address1

    record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity)
    id:1124353
    name:DAVID, Beckham
    titles:[('title', 'Football player')]
    addresses:
        address:{'city': 'London', 'address': None, 'post... (omitted for brevity)
        address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)
    

    Tested with Python: 3.5 - lxml.etree: 3.7.1

    0 讨论(0)
提交回复
热议问题