Order of string replacement function invocations when used with RTL languages

问题

When calling String.replace with a replacement function we're able to retrieve offsets of the matched substrings.

var a = [];
"hello world".replace(/l/g, function (m, i) { a.push(i); });
// a = [2, 3, 9]

In the example above, we're getting a list of offsets for the matching l characters.

Can I count on implementations to always invoke the match function in ascending order of occurrence, even when used with languages that are written from right to left?

That is: Can I be sure that the result above will always be [2,3,9] and not [3,9,2] or any other permutation of those offsets?

This is a follow-up on this question that Tomalak answered with:

Absolutely, yes. Matches are handled from left to right in the source string because left-to-right is how regular expression engines work their way to a string.

However, regarding the case with RTL languages he also said:

That's a good question [...] RTL text definitely changes how JavaScript regular expressions behave.

I've tested with the following RTL snippet in Chrome:

var a = [];
"بلوچی مکرانی".replace(/ی/g, function (m, i) { a.push(i); });
// a = [4, 11]

I don't speak that language but looking at the string I see the ی character as the first character of the string and as the first character after the white space. However, since the text is written right-to-left those positions are actually the last character before the white space and the last character in the string - which translates into [4,11]

So, this seems to work just as expected in Chrome. The question is: Can I trust that the result will be the same on all compliant javascript implementations?

回答1:

I have searched the ECMA-262 5.1 Edition/June 2011 with the keyword "format control", "right to left" and "RTL", and there is no mention of them, except for where it says format control characters are allowed in the string literal and regular expression literal.

From section 7.1

It is useful to allow format-control characters in source text to facilitate editing and display. All format control characters may be used within comments, and within string literals and regular expression literals.

Annex E

7.1: Unicode format control characters are no longer stripped from ECMAScript source text before processing. In Edition 5, if such a character appears in a StringLiteral or RegularExpressionLiteral the character will be incorporated into the literal where in Edition 3 the character would not be incorporated into the literal

With this, I conclude that JavaScript doesn't operate any differently on Right-to-Left characters. It only knows about the UTF-16 code units stored in the string, and works based on the logical order.

来源：https://stackoverflow.com/questions/27905376/order-of-string-replacement-function-invocations-when-used-with-rtl-languages

标签

javascript

regex

right-to-left