问题
I have an xml document from which I want to extract text based on tags.
The part that I want to extract text from looks something like this :
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
When I do
tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
texte = text.text
I'm only able to grab the part that comes before the empty tag <TIP CONTENT=""/>
I tried to delete this tag before getting the rest of the text.
I did :
emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
root.remove(e)
But this is not working.
None of <BlockText>
and <TIP>
are direct children of root.
Thank you.
回答1:
The text After <TIP CONTENT=""/>
belongs to its own tail not the text of the BlockText
tag.
elem.text
is the text following the open tag.
elem.tail
is the text following the close tag. Usually whitespace but in this case it's has actual text.
回答2:
Ok this is what ended up working for me :
emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
print(element.tail)
But I still can't get the text as a whole block (same order). I can get all the BlockText tags and all the TIP tags but not together.
Update :
I used :
tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
texte = ''.join(text.itertext())
回答3:
Another solution for reference only
from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))
Result:
{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']
来源:https://stackoverflow.com/questions/60321983/python-xml-etree-elementtree-remove-empty-tag-in-the-middle-of-text