I\'m trying to output unicode string into RTF format. (using c# and winforms)
From wikipedia:
If a Unicode escape is required, the control wor
You will have to convert the string to a byte[]
array (using Encoding.Unicode.GetBytes(string)
), then loop through that array and prepend a \
and u
character to all Unicode characters you find. When you then convert the array back to a string, you'd have to leave the Unicode characters as numbers.
For example, if your array looks like this:
byte[] unicodeData = new byte[] { 0x15, 0x76 };
it would become:
// 5c = \, 75 = u
byte[] unicodeData = new byte[] { 0x5c, 0x75, 0x15, 0x76 };
Fixed code from accepted answer - added special character escaping, as described in this link
static string GetRtfUnicodeEscapedString(string s)
{
var sb = new StringBuilder();
foreach (var c in s)
{
if(c == '\\' || c == '{' || c == '}')
sb.Append(@"\" + c);
else if (c <= 0x7f)
sb.Append(c);
else
sb.Append("\\u" + Convert.ToUInt32(c) + "?");
}
return sb.ToString();
}
Provided that all the characters that you're catering for exist in the Basic Multilingual Plane (it's unlikely that you'll need anything more), then a simple UTF-16 encoding should suffice.
Wikipedia:
All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.
The following sample program illustrates doing something along the lines of what you want:
static void Main(string[] args)
{
// ë
char[] ca = Encoding.Unicode.GetChars(new byte[] { 0xeb, 0x00 });
var sw = new StreamWriter(@"c:/helloworld.rtf");
sw.WriteLine(@"{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs60 H" + GetRtfUnicodeEscapedString(new String(ca)) + @"llo, World!
}");
sw.Close();
}
static string GetRtfUnicodeEscapedString(string s)
{
var sb = new StringBuilder();
foreach (var c in s)
{
if (c <= 0x7f)
sb.Append(c);
else
sb.Append("\\u" + Convert.ToUInt32(c) + "?");
}
return sb.ToString();
}
The important bit is the Convert.ToUInt32(c)
which essentially returns the code point value for the character in question. The RTF escape for unicode requires a decimal unicode value. The System.Text.Encoding.Unicode
encoding corresponds to UTF-16 as per the MSDN documentation.
Based on the specification, here are some code in java which is tested and works:
public static String escape(String s){
if (s == null) return s;
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++){
char c = s.charAt(i);
if (c >= 0x20 && c < 0x80){
if (c == '\\' || c == '{' || c == '}'){
sb.append('\\');
}
sb.append(c);
}
else if (c < 0x20 || (c >= 0x80 && c <= 0xFF)){
sb.append("\'");
sb.append(Integer.toHexString(c));
}else{
sb.append("\\u");
sb.append((short)c);
sb.append("??");//two bytes ignored
}
}
return sb.toString();
}
The important thing is, you need to append 2 characters (close to the unicode character or just use ? instead) after the escaped uncode. because the unicode occupy 2 bytes.
Also the spec says your should use negative value if the code point greater than 32767, but in my test, it's fine if you don't use negative value.
Here is the spec:
\uN This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number. This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.
As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters.
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.
RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative number