In what JS engines, specifically, are toLowerCase & toUpperCase locale-sensitive?

问题

In the code of some libraries (e.g. AngularJS, the link leads to the specific lines in the code), I can see that custom case-conversion functions are used instead of the standard ones. It's justified by an assumption that in browsers with Turkish locale, the standard functions don't work as expected:

console.log("SCRIPT".toLowerCase()); // "scrıpt"
console.log("script".toUpperCase()); // "SCRİPT"

But is it really or was it ever the case? Do the browsers really behave this way? If so, which of them do? What about node.js? Other JS engines?

The existance of the toLocaleLowerCase and toLocaleUpperCase methods implies that toLowerCase and toUpperCase are locale-invariant, doesn't it?

For what browsers, specifically, does the Angular team retain this check in the code: if ('i' !== 'I'.toLowerCase())...?

If your browser (device) uses the Turkish or Azerbaijan locale, please run this snippet and write me if you discover that the issue indeed exists.

if ('i' !== 'I'.toLowerCase()) {
  document.write('Ooops! toLowerCase is locale-sensitive in your browser. ' +
    'Please write your user-agent in the comments to this question: ' +
    navigator.userAgent); 
} else {
  document.write('toLowerCase isn\'t locale-sensitive in your browser. ' +
    'Everything works as expected!');
}

<html lang="tr">

回答1:

Any JS implementations that follow ECMA-262 5.1 standard have to implement String.prototype.toLocaleLowerCase and String.prototype.toLocaleUpperCase

And as per the standard toLocaleLowerCase is supposed convert string to it's lower case mapping as per locale specific mapping.

Where as toLowerCase converts to lowercase string as defined by unicode mappings.

For most languages toLocaleLowerCase and toLowerCase give the same result. But for certain languages, like turkish the case mapping don't follow unicode mapping thus toLowerCase and toLocaleLowerCase give different result.

The Library/Framework you use (Jquery, Angular, Node whatever else) does not make any difference whatsoever. It's in what JS implementation you use to run your JS libaries that makes and changes things.

For all practical purposes it's accurate to conclude that Node/Angular or any other JS libraries and frameworks all behave exactly the same when dealing with strings (as long as they are used by JS Engine that implements ECMA-262 3 and above). Having said that, I'm sure many frameworks extend the string object to add more functionality, but the basic properties and functions defined by ECMA-262 5.1 always exists and WILL behave exactly the same.

To learn more : http://www.ecma-international.org/ecma-262/5.1/#sec-15.5.4.17

As far as browsers are concerned, all modern browsers implement the ECMA-262 5.1 standards in their JS engine. I am not sure about Node, but from what limited exposure I have with Node, I think they too use JS implemented per ECMA-262 5.1 standard.

回答2:

Note: Please, note that I couldn't test it!

As per ECMAScript specification:

String.prototype.toLowerCase ( )

[...]

For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from S to L without any mapping.

The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies it in Unicode 2.1.8 and later).

[...]

String.prototype.toLocaleLowerCase ( )

This function works exactly the same as toLowerCase except that its result is intended to yield the correct result for the host environment’s current locale, rather than a locale-independent result. There will only be a difference in the few cases (such as Turkish) where the rules for that language conflict with the regular Unicode case mappings.

[...]

And as per Unicode Character Database Special Casing:

[...]

Format

The entries in this file are in the following machine-readable format:

<code>; <lower>; <title>; <upper>; (<condition_list>;)? # <comment>

Unconditional mappings

[...]

Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

[...]

Language-Sensitive Mappings These are characters whose full case mappings depend on language and perhaps also context (which characters come before or after). For more information see the header of this file and the Unicode Standard.

Lithuanian

Lithuanian retains the dot in a lowercase i when followed by accents.

Remove DOT ABOVE after "i" with upper or titlecase

0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE

Introduce an explicit dot above when lowercasing capital I's and J's whenever there are more accents above. (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)

0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I

004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J

012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK

00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE

00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE

0128; 0069 0307 0303; 0128; 0128; lt; #LATIN CAPITAL LETTER I WITH TILDE

Turkish and Azeri

I and i-dotless; I-dot and i are case pairs in Turkish and Azeri The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE

0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE

0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

When lowercasing, unless an I is before a dot_above, it turns into a dotless i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I

0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I

When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I

0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

Note: the following case is already in the UnicodeData.txt file.

0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I

EOF

Also, as per JavaScript for Absolute Beginners (by Terry McNavage):

> "I".toLowerCase() // "i"
> "i".toUpperCase() // "I"
> "I".toLocaleLowerCase() // "<dotless-i>"
> "i".toLocaleUpperCase() // "<dotted-I>"
Note: toLocaleLowerCase() and toLocaleUpperCase() convert case based on your OS settings. You'd have to change those settings to Turkish for the previous sample to work. Or just take my word for it!

And as per bobince's comment over Convert JavaScript String to be all lower case? question:

Accept-Language and navigator.language are two completely separate settings. Accept-Language reflects the user's chosen preferences for what languages they want to receive in web pages (and this setting is unfortuately inaccessible to JS). navigator.language merely reflects which localisation of the web browser was installed, and should generally not be used for anything. Both of these values are unrelated to the system locale, which is the bit that decides what toLocaleLowerCase() will do; that's an OS-level setting out of scope of the browser's prefs.

So, setting lang="tr-TR" to html won't reflect a real test case, since it's an OS setting that's required to reproduce the special casing example.

I think that only lowercasing dotted-I or uppercasing dotless-i would be locale specific when using toLowerCase() or toUpperCase().

As per those credible/official sources, I think you're right: 'i' !== 'I'.toLowerCase() would always evaluate to false.

But, as I said, I couldn't test it here.

来源：https://stackoverflow.com/questions/28792027/in-what-js-engines-specifically-are-tolowercase-touppercase-locale-sensitive

标签

javascript