Python XML Parsing without root

前端 未结 3 884
庸人自扰
庸人自扰 2021-02-14 14:33

I wanted to parse a fairly huge xml-like file which doesn\'t have any root element. The format of the file is:






        
相关标签:
3条回答
  • 2021-02-14 15:00

    How about instead of editing the file do something like this

    import xml.etree.ElementTree as ET
    
    with file("xml-file.xml") as f:
        xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])
    
    0 讨论(0)
  • 2021-02-14 15:04

    ElementTree.fromstringlist accepts an iterable (that yields strings).

    Using it with itertools.chain:

    import itertools
    import xml.etree.ElementTree as ET
    # import xml.etree.cElementTree as ET
    
    with open('xml-like-file.xml') as f:
        it = itertools.chain('<root>', f, '</root>')
        root = ET.fromstringlist(it)
    
    # Do something with `root`
    root.find('.//tag3')
    
    0 讨论(0)
  • 2021-02-14 15:17

    lxml.html can parse fragments:

    from lxml import html
    s = """<tag1>
     <tag2>
     </tag2>
    </tag1>
    
    <tag1>
     <tag3/>
    </tag1>"""
    doc = html.fromstring(s)
    for thing in doc:
        print thing
        for other in thing:
            print other
    """
    >>> 
    <Element tag1 at 0x3411a80>
    <Element tag2 at 0x3428990>
    <Element tag1 at 0x3428930>
    <Element tag3 at 0x3411a80>
    >>>
    """
    

    Courtesy this SO answer

    And if there is more than one level of nesting:

    def flatten(nested):
        """recusively flatten nested elements
    
        yields individual elements
        """
        for thing in nested:
            yield thing
            for other in flatten(thing):
                yield other
    doc = html.fromstring(s)
    for thing in flatten(doc):
        print thing
    

    Similarly, lxml.etree.HTML will parse this. It adds html and body tags:

    d = etree.HTML(s)
    for thing in d.iter():
        print thing
    
    """ 
    <Element html at 0x3233198>
    <Element body at 0x322fcb0>
    <Element tag1 at 0x3233260>
    <Element tag2 at 0x32332b0>
    <Element tag1 at 0x322fcb0>
    <Element tag3 at 0x3233148>
    """
    
    0 讨论(0)
提交回复
热议问题