Python ElementTree won't convert non-breaking spaces when using UTF-8 for output

后端 未结 5 1717
执笔经年
执笔经年 2021-02-20 14:45

I\'m trying to parse, manipulate, and output HTML using Python\'s ElementTree:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as E         


        
5条回答
  •  青春惊慌失措
    2021-02-20 15:38

    0xA0 is a latin1 character, not a unicode character and the value of p.text in the loop is a str and not unicode, that means that in order to encode it in utf-8 it must first be converted by Python implicitly into a unicode string (i.e. using decode). When it is doing this it assumes ascii since it wasn't told anything else. 0xa0 is not a valid ascii character, but it is a valid latin1 character.

    The reason you have latin1 characters instead of unicode characters is because entitydefs is a mapping of names to latin1 encode strings. You need the unicode code point which you can get from htmlentitydef.name2codepoint

    The version below should fix it for you:

    import sys
    from cStringIO  import StringIO
    from xml.etree  import ElementTree as ET
    from htmlentitydefs import name2codepoint
    
    source = StringIO("""
    
    

    Less than <

    Non-breaking space  

    """) parser = ET.XMLParser() parser.parser.UseForeignDTD(True) parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems()) etree = ET.ElementTree() tree = etree.parse(source, parser=parser) for p in tree.findall('.//p'): print ET.tostring(p, encoding='UTF-8')

提交回复
热议问题