Outputting unicode text to an RTF file in python

问题

I am trying to output unicode text to an RTF file from a python script. For background, Wikipedia says

For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter bāʼ ب, specifying that older programs which do not have Unicode support should render it as a question mark instead.

There is also this question on outputting RTF from Java and this one on doing so in C#.

However, what I can't figure out is how to output the unicode code point as a "16-bit signed decimal integer with the Unicode UTF-16 code unit number" from Python. I've tried this:

for char in unicode_string:
    print '\\' + 'u' + ord(char) + '?',

but the output only renders as gibberish when opened in a word processor; the problem appears to be that it's not the UTF-16 code number. But not sure how to get that; though one can encode in utf-16, how does one get the code number?

Incidentally PyRTF does not support unicode (it's listed as a "todo"), and while pyrtf-NG is supposed to do so, that project does not appear to be maintained and has little documentation, so I am wary of using it in a quasi-production system.

Edit: My mistake. There are two bugs in the above code - as pointed out by Wobble below the string has to be a unicode string, not an already encoded one, and the above code produces a result with spaces between characters. The correct code is this:

convertstring=""
for char in unicode(<my_encoded_string>,'utf-8'):
    convertstring = convertstring + '\\' + 'u' + str(ord(char)) + '?'

This works fine, at least with OpenOffice. I am leaving this here as a reference for others (one mistake further corrected after discussion below).

回答1:

Based on the information in your latest edit, I think this function will work properly. Except see the improved version below.

def rtf_encode(unistr):
    return ''.join([c if ord(c) < 128 else u'\\u' + unicode(ord(c)) + u'?' for c in unistr])

>>> test_unicode = u'\xa92012'
>>> print test_unicode
©2012
>>> test_utf8 = test_unicode.encode('utf-8')
>>> print test_utf8
©2012
>>> print rtf_encode(test_utf8.decode('utf-8'))
\u169?2012

Here's another version that's broken down a little to be easier to understand. I also made it consistent in returning an ASCII string rather than keeping Unicode and flubbing it at the join. It also incorporates a fix based on the comments.

def rtf_encode_char(unichar):
    code = ord(unichar)
    if code < 128:
        return str(unichar)
    return '\\u' + str(code if code <= 32767 else code-65536) + '?'

def rtf_encode(unistr):
    return ''.join(rtf_encode_char(c) for c in unistr)

回答2:

Mark Ransom's answer isn't quite correct as it'll not encode codepoints over U+7fff correctly, nor will it escape characters below 0x20 as recommended by the RTF standard.

I've created a simple module that encodes python unicode to RTF control codes called rtfunicode, and wrote about the subject on my blog.

In summary, my method uses a regular expression to map the right codepoints to RTF control codes suitable for inclusion in either PyRTF or pyrtf-ng.

来源：https://stackoverflow.com/questions/9908647/outputting-unicode-text-to-an-rtf-file-in-python

标签

python

rtf