Euro sign issue when reading an RTF file with Python

前端 未结 2 1686
北海茫月
北海茫月 2021-01-22 19:09

I need to generate a document in RTF using Python and pyRTF, everything is ok: I have no problem with accented letters, it accepts even the euro sign without errors, but instead

相关标签:
2条回答
  • 2021-01-22 19:54

    The good news is that you're not doing anything wrong. The bad news is that the RTF is being read as ISO 8859-1 regardless.

    >>> print u'€'.encode('iso-8859-15').decode('iso-8859-1')
    ¤
    

    You'll need to use a Unicode escape if you want it to be read properly.

    >>> print hex(ord(u'€'))
    0x20ac
    
    0 讨论(0)
  • 2021-01-22 20:02

    The RTF standard uses UTF-16, but shaped to fit the RTF command sequence format. Documented at http://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding. pyRTF doesn't do any encoding for you, unfortunately; handling this has been on the project's TODO but obviously they never got to that before abandoning the library.

    This is based on code I used in a project recently. I've now released this as rtfunicode on PyPI, with support for Python 2 and 3; the python 2 version:

    import codecs
    import re
    
    _charescape = re.compile(u'([\x00-\x1f\\\\{}\x80-\uffff])')
    def _replace(match):
        codepoint = ord(match.group(1))
        # Convert codepoint into a signed integer, insert into escape sequence
        return '\\u%s?' % (codepoint if codepoint < 32768 else codepoint - 65536)    
    
    
    def rtfunicode_encode(text, errors):
        # Encode to RTF \uDDDDD? signed 16 integers and replacement char
        return _charescape.sub(_replace, escaped).encode('ascii')
    
    
    class Codec(codecs.Codec):
        def encode(self, input, errors='strict'):
            return rtfunicode_encode(input, errors), len(input)
    
    
    class IncrementalEncoder(codecs.IncrementalEncoder):
        def encode(self, input, final=False):
            return rtfunicode_encode(input, self.errors)
    
    
    class StreamWriter(Codec, codecs.StreamWriter):
        pass
    
    
    def rtfunicode(name):
        if name == 'rtfunicode':
            return codecs.CodecInfo(
                name='rtfunicode',
                encode=Codec().encode,
                decode=Codec().decode,
                incrementalencoder=IncrementalEncoder,
                streamwriter=StreamWriter,
            )
    
    codecs.register(rtfunicode)
    

    Instead of encoding to "iso-8859-15" you can then encode to 'rtfunicode' instead:

    >>> u'\u20AC'.encode('rtfunicode') # EURO currency symbol
    '\\u8364?'
    

    Encode any text you insert into your RTF document this way.

    Note that it only supports UCS-2 unicode (\uxxxx, 2 bytes), not UCS-4 (\Uxxxxxxxx, 4 bytes); rtfunicode 1.1 supports these by simply encoding the UTF-16 surrogate pair to two \uDDDDD? signed integers.

    0 讨论(0)
提交回复
热议问题