问题
I have some text tags in my xml file (pdf converted to xml using pdftohtml from popplers-utils) that looks like this:
<text top="525" left="170" width="603" height="16" font="1">..part of old large book</text>
<text top="546" left="128" width="645" height="16" font="1">with many many pages and some <i>italics text among 'plain' text</i> and more and more text</text>
<text top="566" left="128" width="642" height="16" font="1">etc...</text>
and I can get text envolved with text tag with this sample code:
import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')
some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside <i> I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue
# but no if <i></i> wrap only one word of the string
but I can not get "nodeValue" if it contents another tag (<i> or <b>...)
inside and can not get object either
What is the best way to get all text as plain string like javascript innerHTML method or recurse into child tags even if they wraps some words and not entire nodeValue?
thanks
回答1:
**Question: How to get inner content as string using minidom
This is a Recursive Solution, for instance:
def getText(nodelist):
# Iterate all Nodes aggregate TEXT_NODE
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
else:
# Recursive
rc.append(getText(node.childNodes))
return ''.join(rc)
xmldoc = minidom.parse('../test/text.xml')
nodelist = xmldoc.getElementsByTagName('text')
# Iterate <text ..>...</text> Node List
for node in nodelist:
print(getText(node.childNodes))
Output:
..part of old large book with many many pages and some italics text among 'plain' text and more and more text etc...
Tested with Python: 3.4.2
来源:https://stackoverflow.com/questions/45603446/how-to-get-inner-content-as-string-using-minidom-from-xml-dom