问题
My data set is as following:
<?xml version="1.0" encoding="UTF-8"?>
<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false"
name = "some_name">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false"
name = "some_name">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
<currentowner col_four="00001bvalue"
col_five="00001bvalue"
col_six="00001bfalse"
name = "some_name">
<addr col_seven="00001bvalue"
col_eight="00001bvalue"
col_nine="00001bfalse"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value"
name = "some_name">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false"
name = "some_name">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>
Currently I have two loops, one iterates thourgh child
data, other through granchild
import pandas
import xml.etree.ElementTree as element_tree
from xml.etree.ElementTree import parse
tree = element_tree.parse('<HERE_GOES_XML>')
root = tree.getroot()
name_space = {'ns0': 'http://SOMELINK'}
#root
date_from = root.attrib['date']
print(date_from)
#child
for pharma in root.findall('.//ns0:dept', name_space):
for key, value in pharma.items():
print(key +': ' + value)
#granchild, this must be merged to above so entire script will iterate through entire dept node to move to the next
for owner in root.findall('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
owner_dict = {}
for key, value in owner.items():
print(key +': ' + value)
Current result is:
2021-01-15
dept_id: 00001
col_two: 00001value
col_three: 00001false
dept_id: 00002
col_two: 00002value
col_three: 00002value
col_four: 00001value
col_five: 00001value
col_six: 00001false
col_four: 00002value
col_five: 00002value
col_six: 00002false
I am aiming at nested look that will firstly iterate entire dept
child with its granchildren and only then move to the next one. Expected result would be below set to be later transformed into pandas'
dataframe (I will try to work on this next). Some columns have same name between child/granchild thus prefix would be required or looping through only specific children
.
dept.dept_id: 00001
dept.col_two: 00001value
dept.col_three: 00001false
dept.name: some_name
currentowner.col_four: 00001value
currentowner.col_five: 00001value
currentowner.col_six: 00001false
currentowner.name: some_name
currentowner.col_four: 00001bvalue
currentowner.col_five: 00001bvalue
currentowner.col_six: 00001bfalse
currentowner.name: some_name
addr.col_seven: 00001value
addr.col_eight: 00001value
addr.col_nine: 00001false
dept.dept_id: 00002
dept.col_two: 00002value
dept.col_three: 00002value
dept.name: some_name
currentowner.col_four: 00002value
currentowner.col_five: 00002value
currentowner.col_six: 00002false
currentowner.name: some_name
addr.col_seven: 00002value
addr.col_eight: 00002value
addr.col_nine: 00002false
[UPDATE] - I came across zip
which should do the trick.
dept_list = []
for item in root.iterfind('.//ns0:dept', name_space):
#print(item.attrib)
dept_list.append(item.attrib)
#print(dept_list)
owner_list = []
for item in root.iterfind('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
#print(item.attrib)
owner_list.append(item.attrib)
#print(owner_list)
zipped = zip(dept_list, owner_list)
回答1:
Looping can be done in a list comprehension then building dict from navigating the DOM. Following code goes straight to a data frame.
xml = """<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>"""
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
root.attrib
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**d.attrib,
**d.find("ns0:owners/ns0:currentowner", ns).attrib,
**d.find("ns0:owners/ns0:currentowner/ns0:addr", ns).attrib}
for d in root.findall("ns0:dept", ns)
])
safer version
if any dept had no currentowner or currentowner/addr using .attrib
would fail. Walk the DOM considering these elements to be optional. dict
keys construction changed to name based on tag of element as well as attribute name. Structure the way the comprehensions are structured based on your data design. Need to consider 1 to 1, 1 to optional, 1 to many. Really goes back to papers that Codd wrote in 1970
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()},
**{f"{co.tag.split('}')[1]}.{k}":v for k,v in co.items()},
**{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
for d in root.findall("ns0:dept", ns)
for co in d.findall("ns0:owners/ns0:currentowner", ns)
])
回答2:
You can perform a depth-first search:
root = ElementTree.parse('data.xml').getroot()
ns = {'ns0': 'http://SOMELINK'}
date_from = root.get('date')
print(f'{date_from=}')
for dept in root.findall(f'./ns0:dept', ns):
for key, value in dept.items():
print(f'{key}: {value}')
for node in dept.findall('.//*'):
for key, value in node.items():
print(f'{key}: {value}')
print()
来源:https://stackoverflow.com/questions/65755193/loop-through-xml-in-python