How do I get the full XML or HTML content of an element using ElementTree?

后端未结

关注

 7  1098

That is, all text and subtags, without the tag of an element itself?

Having

blah bleh blih

相关标签:

7条回答

小鲜肉

2020-12-30 08:52

ElementTree works perfectly, you have to assemble the answer yourself. Something like this...

"".join( [ "" if t.text is None else t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )

Thanks to JV amd PEZ for pointing out the errors.

Edit.

>>> import xml.etree.ElementTree as xml
>>> s= '<p>blah <b>bleh</b> blih</p>\n'
>>> t=xml.fromstring(s)
>>> "".join( [ t.text ] + [ xml.tostring(e) for e in t.getchildren() ] )
'blah <b>bleh</b> blih'
>>>

Tail not needed.

0 讨论(0)

后悔当初

2020-12-30 08:54

These are good answers, which answer the OP's question, particularly if the question is confined to HTML. But documents are inherently messy, and the depth of element nesting is usually impossible to predict.

To simulate DOM's getTextContent() you would have to use a (very) simple recursive mechanism.

To get just the bare text:

def get_deep_text( element ):
    text = element.text or ''
    for subelement in element:
        text += get_deep_text( subelement )
    text += element.tail or ''
    return text
print( get_deep_text( element_of_interest ))

To get all the details about the boundaries between raw text:

root_el_of_interest.element_count = 0
def get_deep_text_w_boundaries( element, depth = 0 ):
    root_el_of_interest.element_count += 1
    element_no = root_el_of_interest.element_count 
    indent = depth * '  '
    text1 = '%s(el %d - attribs: %s)\n' % ( indent, element_no, element.attrib, )
    text1 += '%s(el %d - text: |%s|)' % ( indent, element_no, element.text or '', )
    print( text1 )
    for subelement in element:
        get_deep_text_w_boundaries( subelement, depth + 1 )
    text2 = '%s(el %d - tail: |%s|)' % ( indent, element_no, element.tail or '', )
    print( text2 )
get_deep_text_w_boundaries( root_el_of_interest )

Example output from single para in LibreOffice Writer doc (.fodt file):

(el 1 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'Standard'})
(el 1 - text: |Ci-après individuellement la "|)
  (el 2 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'})
  (el 2 - text: |Partie|)
  (el 2 - tail: |" et ensemble les "|)
  (el 3 - attribs: {'{urn:oasis:names:tc:opendocument:xmlns:text:1.0}style-name': 'T5'})
  (el 3 - text: |Parties|)
  (el 3 - tail: |", |)
(el 1 - tail: |
   |)

One of the points about messiness is that there is no hard and fast rule about when a text style indicates a word boundary and when it doesnt: superscript immediately following a word (with no white space) means a separate word in all use cases I can imagine. OTOH sometimes you might find, for example, a document where the first letter is either bolded for some reason, or perhaps uses a different style for the first letter to represent it as upper case, rather than simply using the normal UC character.

And of course the less primarily "English-centric" this discussion gets the greater the subtleties and complexities!

0 讨论(0)

借酒劲吻你

2020-12-30 08:58
This answer is slightly modified of Pupeno's reply. Here I added encoding type into "tostring". This issue took many hours of mine. I hope this small correction will help others.
```
def element_to_string(element):
        s = element.text or ""
        for sub_element in element:
            s += ElementTree.tostring(sub_element, encoding='unicode')
        s += element.tail
        return s
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

慢半拍i

2020-12-30 09:00

This is the solution I ended up using:

def element_to_string(element):
    s = element.text or ""
    for sub_element in element:
        s += etree.tostring(sub_element)
    s += element.tail
    return s

0 讨论(0)

时光取名叫无心

2020-12-30 09:00
I doubt ElementTree is the thing to use for this. But assuming you have strong reasons for using it maybe you could try stripping the root tag from the fragment:
```
 re.sub(r'(^<%s\b.*?>|</%s\b.*?>$)' % (element.tag, element.tag), '', ElementTree.tostring(element))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2020-12-30 09:08
No idea if an external library might be an option, but anyway -- assuming there is one <p> with this text on the page, a jQuery-solution would be:
```
alert($('p').html()); // returns blah <b>bleh</b> blih
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页