I need to generate a document in RTF using Python and pyRTF, everything is ok: I have no problem with accented letters, it accepts even the euro sign without errors, but instead
The RTF standard uses UTF-16, but shaped to fit the RTF command sequence format. Documented at http://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding. pyRTF doesn't do any encoding for you, unfortunately; handling this has been on the project's TODO but obviously they never got to that before abandoning the library.
This is based on code I used in a project recently. I've now released this as rtfunicode on PyPI, with support for Python 2 and 3; the python 2 version:
import codecs
import re
_charescape = re.compile(u'([\x00-\x1f\\\\{}\x80-\uffff])')
def _replace(match):
codepoint = ord(match.group(1))
# Convert codepoint into a signed integer, insert into escape sequence
return '\\u%s?' % (codepoint if codepoint < 32768 else codepoint - 65536)
def rtfunicode_encode(text, errors):
# Encode to RTF \uDDDDD? signed 16 integers and replacement char
return _charescape.sub(_replace, escaped).encode('ascii')
class Codec(codecs.Codec):
def encode(self, input, errors='strict'):
return rtfunicode_encode(input, errors), len(input)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
return rtfunicode_encode(input, self.errors)
class StreamWriter(Codec, codecs.StreamWriter):
pass
def rtfunicode(name):
if name == 'rtfunicode':
return codecs.CodecInfo(
name='rtfunicode',
encode=Codec().encode,
decode=Codec().decode,
incrementalencoder=IncrementalEncoder,
streamwriter=StreamWriter,
)
codecs.register(rtfunicode)
Instead of encoding to "iso-8859-15" you can then encode to 'rtfunicode' instead:
>>> u'\u20AC'.encode('rtfunicode') # EURO currency symbol
'\\u8364?'
Encode any text you insert into your RTF document this way.
Note that it only supports UCS-2 unicode (\uxxxx
, 2 bytes), not UCS-4 (\Uxxxxxxxx
, 4 bytes); rtfunicode
1.1 supports these by simply encoding the UTF-16 surrogate pair to two \uDDDDD?
signed integers.