Good python XML parser to work with namespace heavy documents

前端 未结 3 2004
悲哀的现实
悲哀的现实 2020-12-31 18:32

Python elementTree seems unusable with namespaces. What are my alternatives? BeautifulSoup is pretty rubbish with namespaces too. I don\'t want to strip them out.

Ex

3条回答
  •  伪装坚强ぢ
    2020-12-31 18:53

    lxml is namespace-aware.

    >>> from lxml import etree
    >>> et = etree.XML("""""")
    >>> etree.tostring(et, encoding=str) # encoding=str only needed in Python 3, to avoid getting bytes
    ''
    >>> et.xpath("f:bar", namespaces={"b":"bar", "f": "foo"})
    []
    

    Edit: On your example:

    from lxml import etree
    
    # remove the b prefix in Python 2
    # needed in python 3 because
    # "Unicode strings with encoding declaration are not supported."
    et = etree.XML(b"""...""")
    
    ns = {
        'lom': 'http://ltsc.ieee.org/xsd/LOM',
        'zs': 'http://www.loc.gov/zing/srw/',
        'dc': 'http://purl.org/dc/elements/1.1/',
        'voc': 'http://www.schooletc.co.uk/vocabularies/',
        'srw_dc': 'info:srw/schema/1/dc-schema'
    }
    
    # according to docs, .xpath returns always lists when querying for elements
    # .find returns one element, but only supports a subset of XPath
    record = et.xpath("zs:records/zs:record", namespaces=ns)[0]
    # in this example, we know there's only one record
    # but else, you should apply the following to all elements the above returns
    
    name = record.xpath("//voc:name", namespaces=ns)[0].text
    print("name:", name)
    
    lom_entry = record.xpath("zs:recordData/srw_dc:dc/"
                             "lom:metaMetadata/lom:identifier/"
                             "lom:entry",
                             namespaces=ns)[0].text
    
    print('lom_entry:', lom_entry)
    
    lom_ids = [id.text for id in
               record.xpath("zs:recordData/srw_dc:dc/"
                            "lom:classification/lom:taxonPath/"
                            "lom:taxon/lom:id",
                            namespaces=ns)]
    
    print("lom_ids:", lom_ids)
    

    Output:

    name: Frank Malina
    lom_entry: 2.6
    lom_ids: ['PYTHON', 'XML', 'XML-NAMESPACES']
    

提交回复
热议问题