Extracting text from XML node with minidom

后端 未结 3 467
误落风尘
误落风尘 2021-01-19 05:58

I\'ve looked through several posts but I haven\'t quite found any answers that have solved my problem.

Sample XML =




        
相关标签:
3条回答
  • 2021-01-19 06:35

    A solution with lxml right from the docs:

    from lxml import etree
    from StringIO import StringIO
    
    xml = etree.parse(StringIO('''<TextWithNodes>
    <Node id="0"/>TEXT1<Node id="19"/>TEXT2 <Node id="20"/>TEXT3<Node id="212"/></TextWithNodes>'''))
    
    xml.xpath("//text()")
    Out[43]: ['\n', 'TEXT1', 'TEXT2 ', 'TEXT3']
    

    You also can extract the text of an specific node:

    xml.find(".//Node[@id='19']").text
    

    The issue here is the text in the XML doesn't belong to any node.

    0 讨论(0)
  • 2021-01-19 06:40

    Using xml.etree.ElemetTree (which is similar to lxml which @DiegoNavrro used in his answer, except that etree in part of the standard library and doesn't have XPATH etc.) you can give the following a go:

    import xml.etree.ElementTree as etree
    
    xml_string = """<TextWithNodes>
    <Node id="0"/>TEXT1<Node id="19"/>TEXT2 <Node id="20"/>TEXT3<Node id="212"/>
    </TextWithNodes>
    """
    
    xml_etree = etree.fromstring(xml_string)
    
    text = [element.tail for element in xml_etree]
    # `text` will be ['TEXT1', 'TEXT2 ', 'TEXT3', '\n']
    

    Note, this assumes that the XML <Node id="0"/>TEXT1... is correct. Because the text follows a closing tag, it becomes the tag's tail text. It is not the elements nodeValue, which is why in your code in the question you are getting Nones.

    If you wanted to parse some XML like <Node id="0">TEXT1</Node> you would have to replace the line [element.tail for element in xml_etree] with [element.text for element in xml_etree].

    0 讨论(0)
  • 2021-01-19 06:55

    You should use the ElementTree api instead of minidom for your task (as explained in the other answers here), but if you need to use minidom, here is a solution.

    What you are looking for was added to DOM level 3 as the textContent attribute. Minidom only supports level 1.

    However you can emulate textContent pretty closely with this function:

    def textContent(node):
        if node.nodeType in (node.TEXT_NODE, node.CDATA_SECTION_NODE):
            return node.nodeValue
        else:
            return ''.join(textContent(n) for n in node.childNodes)
    

    Which you can then use like so:

    x = minidom.parseString("""<TextWithNodes>
    <Node id="0"/>TEXT1<Node id="19"/>TEXT2 <Node id="20"/>TEXT3<Node id="212"/></TextWithNodes>""")
    
    twn = x.getElementsByTagName('TextWithNodes')[0]
    
    assert textContent(twn) == u'\nTEXT1TEXT2 TEXT3'
    

    Notice how I got the text content of the parent node TextWithNodes. This is because your Node elements are siblings of those text nodes, not parents of them.

    0 讨论(0)
提交回复
热议问题