surrogate-pairs

Is there encoding in Unicode where every “character” is just one code point?

落花浮王杯 提交于 2019-12-06 00:32:02
Trying to rephrase: Can you map every combining character combination into one code point? I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct? Is this true for Basic Multilingual Plane also? If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but

Unicode surrogate pairs

為{幸葍}努か 提交于 2019-12-04 21:25:27
Say I have a surrogate pair. For example: \u306f\u30fc Is there a function I can use to print the character to the screen? If you want to do it manually: echo chr(0x30) . chr(0x6f) . chr(0x30) . chr(0xfc); If you have the string, you could always do: $callback = function($match) { return chr(hexdec($match[1])) . chr(hexdec($match[2])); } preg_replace_callback('#\\\\u([0-9a-f]{2})([0-9a-f]{2})#', $callback, $string); Or, if php < 5.3 $callback = create_function('$match', 'return chr(hexdec($match[1])) . chr(hexdec($match[2]));' ); preg_replace_callback('#\\\\u([0-9a-f]{2})([0-9a-f]{2})#',

Difference between composite characters and surrogate pairs

烂漫一生 提交于 2019-12-04 14:20:39
问题 In Unicode what is the difference between composite characters and surrogate pairs? To me they sound like similar things - two characters to represent one character. What differentiates these two concepts? 回答1: Surrogate pairs are a weird wart in Unicode. Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 2 21 are

How do I create a string with a surrogate pair inside of it?

♀尐吖头ヾ 提交于 2019-12-03 16:10:57
问题 I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself? 回答1: The term "surrogate pair" refers to a means of encoding Unicode characters

Difference between composite characters and surrogate pairs

末鹿安然 提交于 2019-12-03 08:14:27
In Unicode what is the difference between composite characters and surrogate pairs? To me they sound like similar things - two characters to represent one character. What differentiates these two concepts? Kerrek SB Surrogate pairs are a weird wart in Unicode. Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 2 21 are available, though not all are in use. In the context of Unicode, each number is know as a code point .

Python 2.7: Strange Unicode behavior

﹥>﹥吖頭↗ 提交于 2019-12-02 05:51:49
问题 I am experiencing the following behavior in Python 2.7: >>> a1 = u'\U0001f04f' #1 >>> a2 = u'\ud83c\udc4f' #2 >>> a1 == a2 #3 False >>> a1.encode('utf8') == a2.encode('utf8') #4 True >>> a1.encode('utf8').decode('utf8') == a2.encode('utf8').decode('utf8') #5 True >>> u'\ud83c\udc4f'.encode('utf8') #6 '\xf0\x9f\x81\x8f' >>> u'\ud83c'.encode('utf8') #7 '\xed\xa0\xbc' >>> u'\udc4f'.encode('utf8') #8 '\xed\xb1\x8f' >>> '\xd8\x3c\xdc\x4f'.decode('utf_16_be') #9 u'\U0001f04f' What is the

Output UTF-16? A little stuck

廉价感情. 提交于 2019-12-02 02:36:32
问题 I have some UTF-16 encoded characters in their surrogate pair form. I want to output those surrogate pairs as characters on the screen. Does anyone know how this is possible? 回答1: iconv('UTF-16', 'UTF-8', yourString) 回答2: Your question is a little unclear. If you have ASCII text with embedded UTF-16 escape sequences, you can convert everything to UTF-8 in this way: function unescape_utf16($string) { /* go for possible surrogate pairs first */ $string = preg_replace_callback( '/\\\\u(D[89ab][0

Checking for illegal surrogates in Python 3 strings

不想你离开。 提交于 2019-12-01 22:27:49
Specifically in Python 3.3 and above, is it sufficient to check for orphan surrogates by using the simple match: re.search(r'[\uD800-\uDFFF]', s) Based on the assumption that all legal surrogates would have been represented as astral code points and thus would not match, leaving out the illegal surrogates, or is there caveats and edge cases one needs to be aware of? Yes, that's correct. Code units 0xD800–0xDFFF don't represent valid characters in wide Unicode strings, and in Python 3.3+ (following PEP 393) all Unicode strings are effectively wide. 来源: https://stackoverflow.com/questions

How to reverse a string that contains surrogate pairs

泪湿孤枕 提交于 2019-12-01 07:31:18
I have written this method to reverse a string public string Reverse(string s) { if(string.IsNullOrEmpty(s)) return s; TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(s); var elements = new List<char>(); while (enumerator.MoveNext()) { var cs = enumerator.GetTextElement().ToCharArray(); if (cs.Length > 1) { elements.AddRange(cs.Reverse()); } else { elements.AddRange(cs); } } elements.Reverse(); return string.Concat(elements); } Now, I don't want to start a discussion about how this code could be made more efficient or how there are one liners that I could use instead. I

Python: getting correct string length when it contains surrogate pairs

浪尽此生 提交于 2019-11-30 17:43:31
Consider the following exchange on IPython: In [1]: s = u'華袞與緼𦅷同歸' In [2]: len(s) Out[2]: 8 The correct output should have been 7 , but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one. Even if I use unicodedata , which returns the surrogate pair correctly as a single codepoint ( \U00026177 ), when passed to len() the wrong length is still returned: In [3]: import unicodedata In [4]: unicodedata.normalize(