可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16.
I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far.
UPDATE
I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to my surprise that even in the Japanese Wikipedia Gothic alphabet is the most common. This is also true in the Chinese Wikipedia but it also had many Chinese characters being used up to 50 or 70 times, including "?", "?", and "?".
回答1:
Emoji are now the most common non-BMP characters by far. ?, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter's public stream. It occurs more frequently than the tilde!
回答2:
Excellent question!
The answer is the mathematical letters. This past December I did a scan of the entire PubMed Open Access corpus, and came up with these figures for astral characters in it.
The first number in the figures below is how many copies of each given code point I found in the entire corpus. First, though, to give you a notion on the relative frequencies, here are the top ten trans-ASCII code points in that corpus:
And here now are the trans-BMP code points, in order of decending frequency:
I really wish I knew what they were using U+100002 to do. :(
If those aren't showing up in your browser, you should install George Douros’s Symbola font. It also has all the fun Unicode 6.0.0 code points in it, too.
回答3:
For me, the Mathematical Alphanumeric Symbols that are used for math typesetting with OpenType fonts such as Cambria Math.