I\'ve been playing around with JS and can\'t figure out how JS decides which elements to add to the created array when using Array.from()
. For example, the followin
Array.from
first tries to invoke the iterator of the argument if it has one, and strings do have iterators, so it invokes String.prototype[Symbol.iterator]
, so let's look up how the prototype method works. It's described in the specification here:
- Let O be ? RequireObjectCoercible(this value).
- Let S be ? ToString(O).
- Return CreateStringIterator(S).
Looking up CreateStringIterator
eventually takes you to 21.1.5.2.1 %StringIteratorPrototype%.next ( ), which does:
- Let cp be ! CodePointAt(s, position).
- Let resultString be the String value containing cp.[[CodeUnitCount]] consecutive code units from s beginning with the code unit at index position.
- Set O.[[StringNextIndex]] to position + cp.[[CodeUnitCount]].
- Return CreateIterResultObject(resultString, false).
The CodeUnitCount
is what you're interested in. This number comes from CodePointAt :
- Let first be the code unit at index position within string.
- Let cp be the code point whose numeric value is that of first.
If first is not a leading surrogate or trailing surrogate, then
a. Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: false }
.If first is a trailing surrogate or position + 1 = size, then
a.Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }
.Let second be the code unit at index position + 1 within string.
If second is not a trailing surrogate, then
a. Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }
.Set cp to ! UTF16DecodeSurrogatePair(first, second).
Return the Record
{ [[CodePoint]]: cp, [[CodeUnitCount]]: 2, [[IsUnpairedSurrogate]]: false }
.
So, when iterating over a string with Array.from
, it returns a CodeUnitCount of 2 only when the character in question is the start of a surrogate pair. Characters that are interpreted as surrogate pairs are described here:
Such operations apply special treatment to every code unit with a numeric value in the inclusive range 0xD800 to 0xDBFF (defined by the Unicode Standard as a leading surrogate, or more formally as a high-surrogate code unit) and every code unit with a numeric value in the inclusive range 0xDC00 to 0xDFFF (defined as a trailing surrogate, or more formally as a low-surrogate code unit) using the following rules..:
षि
is not a surrogate pair:
console.log('षि'.charCodeAt()); // First character code: 2359, or 0x937
console.log('षि'.charCodeAt(1)); // Second character code: 2367, or 0x93F
But