I have written a small python script to parse XML data based on Liza Daly\'s blog in Python. However, my code does not parse all the nodes. So for example w
Comment: As it now only outputs results
Outputing results are only for demonstration, tracing and debuging.
To write a record
and addresses
into a SQL
database, for example using sqlite3
, do:
c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
addr[1].update({'id': record['id']})
addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)
To flatten for pandas
Preconditon outside the loop: df = pd.DataFrame()
from copy import copy
addresses = copy(record['addresses'])
del record['addresses']
df_records = []
for addr in addresses:
record.update(addr[1])
df_records.append(record)
df = df.append(df_records, ignore_index=True)
Question: Use
etree.iterparse
to include all nodes in XML file
The following class Entity
do:
XML
File using lxml.etree.iterparse
. <entity>...</entity>
Element Tree are deleted after processing. <entity>...</entity>
Tree a dict {tag, value, ...}
. generator objects
to yield
the dict
. <addresses>/<address>
are List of Tuple [(address, {tag, text})...
. ToDo:
- To flatten into many Records, loop
record['addresses']
- To equal different tag names:
address
andaddress1
- To flatten, Sequence tags, e.g.
<titels>
,<probs>
and<dobs>
from lxml import etree
class Entity:
def __init__(self, fh):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
:param fh: File Handle from the XML File to parse
"""
self.context = etree.iterparse(fh, events=("end",), tag=['entity'])
def _parse(self):
"""
Parse the XML File for all '<entity>...</entity>' Elements
Clear/Delete the Element Tree after processing
:return: Yield the current '<entity>...</entity>' Element Tree
"""
for event, elem in self.context:
yield elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def sequence(self, elements):
"""
Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
If found a nested Sequence Element, e.g. <address>,
to a Tuple ('address', {tag, text})
:param elements: The Sequence Element
:return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
"""
_elements = []
for elem in elements:
if len(elem):
_elements.append((elem.tag, dict(self.sequence(elem))))
else:
_elements.append((elem.tag, elem.text))
return _elements
def __iter__(self):
"""
Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()
:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for xml_entity in self._parse():
entity = {'id': xml_entity.attrib['id']}
for elem in xml_entity:
# if elem is Sequence
if len(elem):
# Append tuple(tag, value)
entity[elem.tag] = self.sequence(elem)
else:
entity[elem.tag] = elem.text
yield entity
if __name__ == "__main__":
with open('.\\FILE.XML', 'rb') as in_xml_
for record in Entity(in_xml):
print("record:{}".format(record))
for key, value in record.items():
if isinstance(value, (list)):
#print_list(key, value)
print("{}:{}".format(key, value))
else:
print("{}:{}".format(key, value))
Output: Shows only the first Record and only 4 fields.
Note: There is a pitfall with unique tag names:address
andaddress1
record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity) id:1124353 name:DAVID, Beckham titles:[('title', 'Football player')] addresses: address:{'city': 'London', 'address': None, 'post... (omitted for brevity) address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)
Tested with Python: 3.5 - lxml.etree: 3.7.1