Extracting XML into data frame with parent attribute as column title

后端 未结 1 947
隐瞒了意图╮
隐瞒了意图╮ 2020-12-17 03:39

I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google,

相关标签:
1条回答
  • 2020-12-17 04:15

    I recommend just parsing to a DataFrame first, similar to how you are already (see below for my implementation) and then tweaking it to your requirements.

    Then you're looking for a pivot:

    In [11]: df
    Out[11]:
      child  Time  grandchild
    0  blah  1200         100
    1  blah  1300          30
    2   abc  1200           2
    3   abc  1300           4
    4   abc  1400           2
    
    In [12]: df.pivot('Time', 'child', 'grandchild')
    Out[12]:
    child  abc  blah
    Time
    1200     2   100
    1300     4    30
    1400     2   NaN
    

    I recommend first parse from a file and take out the things you want into a list of tuples:

    from lxml import etree
    root = etree.parse(file_name)
    
    parents = root.getchildren()[0].getchildren()
    
    In [21]: elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
                          for p in parents
                          for c in p
                          for gc in c]
    
    In [22]: elems
    Out[22]:
    [('blah', 1200, 100),
     ('blah', 1300, 30),
     ('blah', 1400, 70),
     ('abc', 1200, 2),
     ('abc', 1300, 4),
     ('abc', 1400, 2)]
    

    For multiple files you could just whack it in an even longer list comprehension. Which shouldn't be too slow unless you have a huge number of xmls (here files is the list of xmls)...

    elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
                for f in files
                for p in etree.parse(f).getchildren()[0].getchildren()
                for c in p
                for gc in c]
    

    Put them in a DataFrame:

    In [23]: pd.DataFrame(elems, columns=['child', 'Time', 'grandchild'])
    Out[23]:
      child  Time grandchild
    0  blah  1200        100
    1  blah  1300         30
    2  blah  1400         70
    3   abc  1200          2
    4   abc  1300          4
    5   abc  1400          2
    

    then do the pivot. :)

    0 讨论(0)
提交回复
热议问题