Convert Latin characters from Shift JIS to Latin characters in Unicode

问题

I'm working on parsing files with Shift-JIS encoded strings within the binary data. My current code is this:

public static string DecodeShiftJISString(this byte[] data, int index, int length)
{
    byte[] utf8Bytes = Encoding.Convert(Encoding.GetEncoding(932), Encoding.UTF8, data);
    return Encoding.UTF8.GetString(utf8Bytes);
}

It works fine and I am able to get usable strings from this method, although when I display strings with Latin characters into my WinForms application, I see that the characters are wider than normal.

Latin characters in Shift-JIS string

I'm not sure if this is an issue with my encoding logic, or the way I'm supposed to display the strings (I just pass them directly into my controls). Any help would be appreciated!

回答1:

These aren't normal ASCII characters, they're ‘fullwidth variants’ in the range U+FF01 fullwidth exclamation mark upwards. They're for lining up formatting when setting a mixture of Latin and CJK characters.

Unicode would prefer weird characters like this, which are just semantically-identical stylistic variants of existing characters, not to exist. But it has to include them to round-trip to legacy encodings like Shift-JIS. For this reason they are called Compatibility characters.

You can convert compatibility characters to their basic variants by using Unicode normalisation with a ‘K’ format such as NFKC. In Win32 you can do this using NormalizeString().

来源：https://stackoverflow.com/questions/33454529/convert-latin-characters-from-shift-jis-to-latin-characters-in-unicode

标签

.net

unicode

encoding