How do I use xml namespaces with find/findall in lxml?

前端 未结 4 969
星月不相逢
星月不相逢 2020-12-05 02:50

I\'m trying to parse content in an OpenOffice ODS spreadsheet. The ods format is essentially just a zipfile with a number of documents. The content of the spreadsheet is sto

相关标签:
4条回答
  • 2020-12-05 03:10

    Maybe the first thing to notice is that the namespaces are defined at Element level, not Document level.

    Most often though, all namespaces are declared in the document's root element (office:document-content here), which saves us parsing it all to collect inner xmlns scopes.

    Then an element nsmap includes :

    • a default namespace, with None prefix (not always)
    • all ancestors namespaces, unless overridden.

    If, as ChrisR mentionned, the default namespace is not supported, you can use a dict comprehension to filter it out in a more compact expression.

    You have a slightly different syntax for xpath and ElementPath.


    So here's the code you could use to get all your first table's rows (tested with: lxml=3.4.2) :

    import zipfile
    from lxml import etree
    
    # Open and parse the document
    zf = zipfile.ZipFile('spreadsheet.ods')
    tree = etree.parse(zf.open('content.xml'))
    
    # Get the root element
    root = tree.getroot()
    
    # get its namespace map, excluding default namespace
    nsmap = {k:v for k,v in root.nsmap.iteritems() if k}
    
    # use defined prefixes to access elements
    table = tree.find('.//table:table', nsmap)
    rows = table.findall('table:table-row', nsmap)
    
    # or, if xpath is needed:
    table = tree.xpath('//table:table', namespaces=nsmap)[0]
    rows = table.xpath('table:table-row', namespaces=nsmap)
    
    0 讨论(0)
  • 2020-12-05 03:10

    Etree won't find namespaced elements if there are no xmlns definitions in the XML file. For instance:

    import lxml.etree as etree
    
    xml_doc = '<ns:root><ns:child></ns:child></ns:root>'
    
    tree = etree.fromstring(xml_doc)
    
    # finds nothing:
    tree.find('.//ns:root', {'ns': 'foo'})
    tree.find('.//{foo}root', {'ns': 'foo'})
    tree.find('.//ns:root')
    tree.find('.//ns:root')
    

    Sometimes that is the data you are given. So, what can you do when there is no namespace?

    My solution: add one.

    import lxml.etree as etree
    
    xml_doc = '<ns:root><ns:child></ns:child></ns:root>'
    xml_doc_with_ns = '<ROOT xmlns:ns="foo">%s</ROOT>' % xml_doc
    
    tree = etree.fromstring(xml_doc_with_ns)
    
    # finds what you're looking for:
    tree.find('.//{foo}root')
    
    0 讨论(0)
  • 2020-12-05 03:21

    Here's a way to get all the namespaces in the XML document (and supposing there's no prefix conflict).

    I use this when parsing XML documents where I do know in advance what the namespace URLs are, and only the prefix.

            doc = etree.XML(XML_string)
    
            # Getting all the name spaces.
            nsmap = {}
            for ns in doc.xpath('//namespace::*'):
                if ns[0]: # Removes the None namespace, neither needed nor supported.
                    nsmap[ns[0]] = ns[1]
            doc.xpath('//prefix:element', namespaces=nsmap)
    
    0 讨论(0)
  • 2020-12-05 03:32

    If root.nsmap contains the table namespace prefix then you could:

    root.xpath('.//table:table', namespaces=root.nsmap)
    

    findall(path) accepts {namespace}name syntax instead of namespace:name. Therefore path should be preprocessed using namespace dictionary to the {namespace}name form before passing it to findall().

    0 讨论(0)
提交回复
热议问题