I\'ve been playing around with JS and can\'t figure out how JS decides which elements to add to the created array when using Array.from()
. For example, the followin
It's all about the code behind the characters. Some are coded in two bytes (UTF-16) and are interpreted by Array.from
as two characters. Gotta check the list of the characters :
http://www.fileformat.info/info/charset/UTF-8/list.htm
http://www.fileformat.info/info/charset/UTF-16/list.htm
function displayHexUnicode(s) {
console.log(s.split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),""));
}
displayHexUnicode('षि');
console.log(Array.from('षि').forEach(x => displayHexUnicode(x)));
function displayHexUnicode(s) {
console.log(s.split("").reduce((hex,c)=>hex+=c.charCodeAt(0).toString(16).padStart(4,"0"),""));
}
displayHexUnicode('
UTF-16 (the encoding used for strings in js) uses 16bit units. So every unicode that can be represented using 15 bit is represented as one code point, everything else as two, known as surrogate pairs. The iterator of strings iterates over code points.
UTF-16 on Wikipedia
Array.from
first tries to invoke the iterator of the argument if it has one, and strings do have iterators, so it invokes String.prototype[Symbol.iterator]
, so let's look up how the prototype method works. It's described in the specification here:
- Let O be ? RequireObjectCoercible(this value).
- Let S be ? ToString(O).
- Return CreateStringIterator(S).
Looking up CreateStringIterator
eventually takes you to 21.1.5.2.1 %StringIteratorPrototype%.next ( ), which does:
- Let cp be ! CodePointAt(s, position).
- Let resultString be the String value containing cp.[[CodeUnitCount]] consecutive code units from s beginning with the code unit at index position.
- Set O.[[StringNextIndex]] to position + cp.[[CodeUnitCount]].
- Return CreateIterResultObject(resultString, false).
The CodeUnitCount
is what you're interested in. This number comes from CodePointAt :
- Let first be the code unit at index position within string.
- Let cp be the code point whose numeric value is that of first.
If first is not a leading surrogate or trailing surrogate, then
a. Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: false }
.If first is a trailing surrogate or position + 1 = size, then
a.Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }
.Let second be the code unit at index position + 1 within string.
If second is not a trailing surrogate, then
a. Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }
.Set cp to ! UTF16DecodeSurrogatePair(first, second).
Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 2, [[IsUnpairedSurrogate]]: false }
.
So, when iterating over a string with Array.from
, it returns a CodeUnitCount of 2 only when the character in question is the start of a surrogate pair. Characters that are interpreted as surrogate pairs are described here:
Such operations apply special treatment to every code unit with a numeric value in the inclusive range 0xD800 to 0xDBFF (defined by the Unicode Standard as a leading surrogate, or more formally as a high-surrogate code unit) and every code unit with a numeric value in the inclusive range 0xDC00 to 0xDFFF (defined as a trailing surrogate, or more formally as a low-surrogate code unit) using the following rules..:
षि
is not a surrogate pair:
console.log('षि'.charCodeAt()); // First character code: 2359, or 0x937
console.log('षि'.charCodeAt(1)); // Second character code: 2367, or 0x93F
But