Euro sign issue when reading an RTF file with Python

心已入冬 提交于 2019-12-31 03:24:06

问题


I need to generate a document in RTF using Python and pyRTF, everything is ok: I have no problem with accented letters, it accepts even the euro sign without errors, but instead of , I get this sign: ¤. I encode the strings in this way:

x.encode("iso-8859-15")

I googled a lot, but I was not able to solve this issue, what do I have to do to get the euro sign?


回答1:


The RTF standard uses UTF-16, but shaped to fit the RTF command sequence format. Documented at http://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding. pyRTF doesn't do any encoding for you, unfortunately; handling this has been on the project's TODO but obviously they never got to that before abandoning the library.

This is based on code I used in a project recently. I've now released this as rtfunicode on PyPI, with support for Python 2 and 3; the python 2 version:

import codecs
import re

_charescape = re.compile(u'([\x00-\x1f\\\\{}\x80-\uffff])')
def _replace(match):
    codepoint = ord(match.group(1))
    # Convert codepoint into a signed integer, insert into escape sequence
    return '\\u%s?' % (codepoint if codepoint < 32768 else codepoint - 65536)    


def rtfunicode_encode(text, errors):
    # Encode to RTF \uDDDDD? signed 16 integers and replacement char
    return _charescape.sub(_replace, escaped).encode('ascii')


class Codec(codecs.Codec):
    def encode(self, input, errors='strict'):
        return rtfunicode_encode(input, errors), len(input)


class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        return rtfunicode_encode(input, self.errors)


class StreamWriter(Codec, codecs.StreamWriter):
    pass


def rtfunicode(name):
    if name == 'rtfunicode':
        return codecs.CodecInfo(
            name='rtfunicode',
            encode=Codec().encode,
            decode=Codec().decode,
            incrementalencoder=IncrementalEncoder,
            streamwriter=StreamWriter,
        )

codecs.register(rtfunicode)

Instead of encoding to "iso-8859-15" you can then encode to 'rtfunicode' instead:

>>> u'\u20AC'.encode('rtfunicode') # EURO currency symbol
'\\u8364?'

Encode any text you insert into your RTF document this way.

Note that it only supports UCS-2 unicode (\uxxxx, 2 bytes), not UCS-4 (\Uxxxxxxxx, 4 bytes); rtfunicode 1.1 supports these by simply encoding the UTF-16 surrogate pair to two \uDDDDD? signed integers.




回答2:


The good news is that you're not doing anything wrong. The bad news is that the RTF is being read as ISO 8859-1 regardless.

>>> print u'€'.encode('iso-8859-15').decode('iso-8859-1')
¤

You'll need to use a Unicode escape if you want it to be read properly.

>>> print hex(ord(u'€'))
0x20ac


来源:https://stackoverflow.com/questions/10852810/euro-sign-issue-when-reading-an-rtf-file-with-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!