How to avoid browsers Unicode normalization when submitting a form with Unicode

后端 未结 3 1800
鱼传尺愫
鱼传尺愫 2021-02-07 10:12

When rendering the following Unicode text in HTML, it turns out that the browser (Google Chrome) do some form of Unicode normalization when posting the data back to the server.

3条回答
  •  再見小時候
    2021-02-07 10:43

    This seems to a be a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC, which means, among other things, reordering consecutive combining marks to a “canonical” order. This was new to me, and bad news in cases like this. The worst thing is that different browsers behave differently.

    Using a simplified version of your test case http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text (using a server-side script that just echoes the raw data), I noticed that Chrome and Safari reorder the diacritic marks in U+05E9 U+05C1 U+05B5 (SHIN, SHIN DOT, TSERE), whereas IE, Firefox, and Opera do not.

    I also ran a simple test with Latin letter e followed by combinining diaeresis U+0308. WebKit browsers convert it to the single character ë, as per NFC rules, whereas other browsers keep the character pair intact.

    This seems to be an intentional feature, ever since 2006; https://bugs.webkit.org/show_bug.cgi?id=8769 proudly announces this as part of a bug fix! This might explain the status of the W3C policy document; its current version is WebKit-minded in this issue, but other browser vendors either aren’t interested or knowingly oppose the idea of “early normalization.”

    I don’t think there is a way to prevent this. But you could warn users against using Chrome and Safari. You could even use a hidden field containing a simple problem case, then check server side whether it was transmitted as−is, and tell the user to change browser if it isn’t.

    Fixing the order server-side isn’t simple, because common normalization routines apparently do not support the order needed. You could normalize to fully decomposed form (NFD), then reorder combining marks using your own code for the purpose. Perhaps simpler and safer, you could just run an ad hoc replacement routine that replaces sequences of combining marks with other sequences. This would be safer because it would not affect characters other than those you want to affect, whereas NFD decomposes Latin letters with diacritics, among other things.

    According to Unicode principles, canonically equivalent strings (e.g., differing only in the order of consecutive diacritic marks) are different representations of the same data but distinct as sequences of Unicode characters (code points); they are not expected to differ in presentation, but they may, and often do. Generally, you should not expect programs to treat canonically equivalent strings as different, though programs may make a difference. See Unicode Normalization FAQ.

    The FAQ entry claims that the problems of biblical Hebrew have been solved by the introduction of COMBINING GRAPHEME JOINER. Although it prevents the reordering in Chrome, it’s a clumsy method, and it may mess up rendering (it does in web browsers; diacritic marks may get badly misplaced).

提交回复
热议问题