I need to generate a document in RTF using Python and pyRTF, everything is ok: I have no problem with accented letters, it accepts even the euro sign without errors, but instead
The good news is that you're not doing anything wrong. The bad news is that the RTF is being read as ISO 8859-1 regardless.
>>> print u'€'.encode('iso-8859-15').decode('iso-8859-1')
¤
You'll need to use a Unicode escape if you want it to be read properly.
>>> print hex(ord(u'€'))
0x20ac
The RTF standard uses UTF-16, but shaped to fit the RTF command sequence format. Documented at http://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding. pyRTF doesn't do any encoding for you, unfortunately; handling this has been on the project's TODO but obviously they never got to that before abandoning the library.
This is based on code I used in a project recently. I've now released this as rtfunicode on PyPI, with support for Python 2 and 3; the python 2 version:
import codecs
import re
_charescape = re.compile(u'([\x00-\x1f\\\\{}\x80-\uffff])')
def _replace(match):
codepoint = ord(match.group(1))
# Convert codepoint into a signed integer, insert into escape sequence
return '\\u%s?' % (codepoint if codepoint < 32768 else codepoint - 65536)
def rtfunicode_encode(text, errors):
# Encode to RTF \uDDDDD? signed 16 integers and replacement char
return _charescape.sub(_replace, escaped).encode('ascii')
class Codec(codecs.Codec):
def encode(self, input, errors='strict'):
return rtfunicode_encode(input, errors), len(input)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
return rtfunicode_encode(input, self.errors)
class StreamWriter(Codec, codecs.StreamWriter):
pass
def rtfunicode(name):
if name == 'rtfunicode':
return codecs.CodecInfo(
name='rtfunicode',
encode=Codec().encode,
decode=Codec().decode,
incrementalencoder=IncrementalEncoder,
streamwriter=StreamWriter,
)
codecs.register(rtfunicode)
Instead of encoding to "iso-8859-15" you can then encode to 'rtfunicode' instead:
>>> u'\u20AC'.encode('rtfunicode') # EURO currency symbol
'\\u8364?'
Encode any text you insert into your RTF document this way.
Note that it only supports UCS-2 unicode (\uxxxx
, 2 bytes), not UCS-4 (\Uxxxxxxxx
, 4 bytes); rtfunicode
1.1 supports these by simply encoding the UTF-16 surrogate pair to two \uDDDDD?
signed integers.