Is the Unicode Basic Multilingual Plane enough for CJK speakers?

旧街凉风 提交于 2019-12-05 09:50:18

For people who might be looking for an actual answer to the actual question: the application that prompted this question is now in production allowing only characters from the BMP (actually a limited subset).

Multiple international customers using Korean language in production - Japanese going live soon. Chinese is in planning (I have my doubts that the BMP will be sufficient for that, but we'll see I guess).

It's fine - no reported issues related to unsupported characters.

But that's just anecdotal evidence, really. Just because my customers were fine with it - that doesn't mean yours will be. For context, customers of the app are international companies, hundreds of employees using the application to process hundreds of thousands of their customers.

The majority of CJK codepoints are defined in the BMP, however CJK Ideographs are not. So if you do not need to support Ideographs, then the BMP is fine, otherwise it is not.

However, I would consider any implementation that does not recognize and process UTF-16 surrogates, even if it does not handle the Unicode codepoints they represent, to be broken.

Unfortunately CJK support in Unicode is broken. The BMP is not enough to properly support CJK, but worse than that even if you do implement full support for all Unicode pages it is still broken.

The basic problem is that they tried to merge characters from all three languages that look kinda similar but are not really the same. The result is that they only look right if you select the correct font to display them. For example, a particular character will only look right to a Chinese person if you render it with a Chinese font, and only look right to a Japanese person if you render it with a Japanese font.

There is no universal font. There is no way to determine which language a character is supposed to be from, so you have to somehow guess which font to use. You can try to examine the system language or some other hack like that. You can't support two languages in the same document unless you have additional metadata. If you get raw Unicode strings without any indication of what language they are in, you are screwed.

It's a total disaster. You need to talk to your clients to figure out their needs and how they indicate to their systems what font to use for broken Unicode characters.

Edit: Also need to mention, some characters required for people's names are missing from Unicode. Later revisions are better, but of course you also need updated fonts to take advantage of them.

Unless you are a fond developer or developing an operating systems you should not care about that, let the OS layer deal with it.

Just implement proper Unicode support in your application and allow the operating system to deal with how the characters are types and displayed.

If you are using custom fonts in your application you may be in trouble

In the end to answer your question: NO, Unicode support is not only BMP and you need to support Unicode.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!