Extracting XML into data frame with parent attribute as column title

后端未结

关注

 1  948

I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google,

相关标签:

1条回答

伪装坚强ぢ

2020-12-17 04:15

I recommend just parsing to a DataFrame first, similar to how you are already (see below for my implementation) and then tweaking it to your requirements.

Then you're looking for a pivot:

In [11]: df
Out[11]:
  child  Time  grandchild
0  blah  1200         100
1  blah  1300          30
2   abc  1200           2
3   abc  1300           4
4   abc  1400           2

In [12]: df.pivot('Time', 'child', 'grandchild')
Out[12]:
child  abc  blah
Time
1200     2   100
1300     4    30
1400     2   NaN

I recommend first parse from a file and take out the things you want into a list of tuples:

from lxml import etree
root = etree.parse(file_name)

parents = root.getchildren()[0].getchildren()

In [21]: elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
                      for p in parents
                      for c in p
                      for gc in c]

In [22]: elems
Out[22]:
[('blah', 1200, 100),
 ('blah', 1300, 30),
 ('blah', 1400, 70),
 ('abc', 1200, 2),
 ('abc', 1300, 4),
 ('abc', 1400, 2)]

For multiple files you could just whack it in an even longer list comprehension. Which shouldn't be too slow unless you have a huge number of xmls (here files is the list of xmls)...

elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
            for f in files
            for p in etree.parse(f).getchildren()[0].getchildren()
            for c in p
            for gc in c]

Put them in a DataFrame:

In [23]: pd.DataFrame(elems, columns=['child', 'Time', 'grandchild'])
Out[23]:
  child  Time grandchild
0  blah  1200        100
1  blah  1300         30
2  blah  1400         70
3   abc  1200          2
4   abc  1300          4
5   abc  1400          2

then do the pivot. :)

0 讨论(0)