How to test an application for correct encoding (e.g. UTF-8)

前端未结

关注

 5  996

Encoding issues are among the one topic that have bitten me most often during development. Every platform insists on its own encoding, most likely some non-UTF-8 defaults are in

Follow-Up Post:

After devising some tests for my application, I realized that I had put together a small list of encoded values that might be helpful to others.

I am using the following international strings in my test:

(NOTE: here comes some UTF-8 encoded text... hopefully you can see this in your browser)

ユーザー別サイト
简体中文
크로스 플랫폼으로
מדורים מבוקשים
أفضل البحوث
Σὲ γνωρίζω ἀπὸ
Десятую Международную
แผ่นดินฮั่นเสื่อมโทรมแสนสังเวช
∮ E⋅da = Q, n → ∞, ∑ f(i) = ∏ g(i)
français langue étrangère
mañana olé

(End of UTF-8 foreign/non-English text)

However, at various points during testing, I realized that it was insufficient to only have information about how the strings were supposed to look when rendered in their respective foreign alphabets. I also needed to know the correct Unicode codepoint numbers, and also the correct hexadecimal values for these strings in at least two encodings (UCS-2 and UTF-8).

Here is the equivalent code-point numbering and hex values:

str = L"\u30E6\u30FC\u30B6\u30FC\u5225\u30B5\u30A4\u30C8"; // JAPAN 
// Little endian UTF-16/UCS-2: e6 30 fc 30 b6 30 fc 30 25 52 b5 30 a4 30 c8 30 00 00
// Hex of UTF-8: e3 83 a6 e3 83 bc e3 82 b6 e3 83 bc e5 88 a5 e3 82 b5 e3 82 a4 e3 83 88 00 

str = L"\u7B80\u4F53\u4E2D\u6587"; // CHINA 
// Little endian UTF-16/UCS-2: 80 7b 53 4f 2d 4e 87 65 00 00 
// Hex of UTF-8: e7 ae 80 e4 bd 93 e4 b8 ad e6 96 87 00

str = L"\uD06C\uB85C\uC2A4 \uD50C\uB7AB\uD3FC\uC73C\uB85C"; // KOREA 
// Little endian UTF-16/UCS-2: 6c d0 5c b8 a4 c2 20 00 0c d5 ab b7 fc d3 3c c7 5c b8 00 00
// Hex of UTF-8: ed 81 ac eb a1 9c ec 8a a4 20 ed 94 8c eb 9e ab ed 8f bc ec 9c bc eb a1 9c 00 

str = L"\u05DE\u05D3\u05D5\u05E8\u05D9\u05DD \u05DE\u05D1\u05D5\u05E7\u05E9\u05D9\u05DD"; // ISRAEL 
// Little endian UTF-16/UCS-2: de 05 d3 05 d5 05 e8 05 d9 05 dd 05 20 00 de 05 d1 05 d5 05 e7 05 e9 05 d9 05 dd 05 00 00
// Hex of UTF-8: d7 9e d7 93 d7 95 d7 a8 d7 99 d7 9d 20 d7 9e d7 91 d7 95 d7 a7 d7 a9 d7 99 d7 9d 00

str = L"\u0623\u0641\u0636\u0644 \u0627\u0644\u0628\u062D\u0648\u062B"; // EGYPT 
// Little endian UTF-16/UCS-2: 23 06 41 06 36 06 44 06 20 00 27 06 44 06 28 06 2d 06 48 06 2b 06 00 00
// Hex of UTF-8: d8 a3 d9 81 d8 b6 d9 84 20 d8 a7 d9 84 d8 a8 d8 ad d9 88 d8 ab 00 

str = L"\u03A3\u1F72 \u03B3\u03BD\u03C9\u03C1\u03AF\u03B6\u03C9 \u1F00\u03C0\u1F78"; // GREECE 
// Little endian UTF-16/UCS-2: a3 03 72 1f 20 00 b3 03 bd 03 c9 03 c1 03 af 03 b6 03 c9 03 20 00 00
// Hex of UTF-8: ce a3 e1 bd b2 20 ce b3 ce bd cf 89 cf 81 ce af ce b6 cf 89 20 e1 bc 80 cf 80 e1 bd b8 00 

str = L"\u0414\u0435\u0441\u044F\u0442\u0443\u044E \u041C\u0435\u0436\u0434\u0443\u043D\u0430\u0440\u043E\u0434\u043D\u0443\u044E"; // RUSSIA 
// Little endian UTF-16/UCS-2: 14 04 35 04 41 04 4f 04 42 04 43 04 4e 04 20 00 1c 04 35 04 36 04 34 04 43 04 3d 04 30 04 40 04 3e 04 34 04 3d 04 43 04 4e 04 00 00
// Hex of UTF-8: d0 94 d0 b5 d1 81 d1 8f d1 82 d1 83 d1 8e 20 d0 9c d0 b5 d0 b6 d0 b4 d1 83 d0 bd d0 b0 d1 80 d0 be d0 b4 d0 bd d1 83 d1 8e 00

str = L"\u0E41\u0E1C\u0E48\u0E19\u0E14\u0E34\u0E19\u0E2E\u0E31\u0E48\u0E19\u0E40\u0E2A\u0E37\u0E48\u0E2D\u0E21\u0E42\u0E17\u0E23\u0E21\u0E41\u0E2A\u0E19\u0E2A\u0E31\u0E07\u0E40\u0E27\u0E0A"; // THAILAND
// Little endian UTF-16/UCS-2: 41 0e 1c 0e 48 0e 19 0e 14 0e 34 0e 19 0e 2e 0e 31 0e 48 0e 19 0e 40 0e 2a 0e 37 0e 48 0e 2d 0e 21 0e 42 0e 17 0e 23 0e 21 0e 41 0e 2a 0e 19 0e 2a 0e 31 0e 07 0e 40 0e 27 0e 0a 0e 00 00
// Hex of UTF-8: e0 b9 81 e0 b8 9c e0 b9 88 e0 b8 99 e0 b8 94 e0 b8 b4 e0 b8 99 e0 b8 ae e0 b8 b1 e0 b9 88 e0 b8 99 e0 b9 80 e0 b8 aa e0 b8 b7 e0 b9 88 e0 b8 ad e0 b8 a1 e0 b9 82 e0 b8 97 e0 b8 a3 e0 b8 a1 e0 b9 81 e0 b8 aa e0 b8 99 e0 b8 aa e0 b8 b1 e0 b8 87 e0 b9 80 e0 b8 a7 e0 b8 8a 00

str = L"\u222E E\u22C5da = Q,  n \u2192 \u221E, \u2211 f(i) = \u220F g(i)"; // MATHEMATICS 
// Little endian UTF-16/UCS-2: 2e 22 20 00 45 00 c5 22 64 00 61 00 20 00 3d 00 20 00 51 00 2c 00 20 00 20 00 6e 00 20 00 92 21 20 00 1e 22 2c 00 20 00 11 22 20 00 66 00 28 00 69 00 29 00 20 00 3d 00 20 00 0f 22 20 00 67 00 28 00 69 00 29 00 00 00
// Hex of UTF-8: e2 88 ae 20 45 e2 8b 85 64 61 20 3d 20 51 2c 20 20 6e 20 e2 86 92 20 e2 88 9e 2c 20 e2 88 91 20 66 28 69 29 20 3d 20 e2 88 8f 20 67 28 69 29 00 

str = L"fran\u00E7ais langue \u00E9trang\u00E8re"; // FRANCE
// Little endian UTF-16/UCS-2: 66 00 72 00 61 00 6e 00 e7 00 61 00 69 00 73 00 20 00 6c 00 61 00 6e 00 67 00 75 00 65 00 20 00 e9 00 74 00 72 00 61 00 6e 00 67 00 e8 00 72 00 65 00 00 00
// Hex of UTF-8: 66 72 61 6e c3 a7 61 69 73 20 6c 61 6e 67 75 65 20 c3 a9 74 72 61 6e 67 c3 a8 72 65 00

str = L"ma\u00F1ana ol\u00E9"; // SPAIN
// Little endian UTF-16/UCS-2: 6d 00 61 00 f1 00 61 00 6e 00 61 00 20 00 6f 00 6c 00 e9 00 00 00
// Hex of UTF-8: 6d 61 c3 b1 61 6e 61 20 6f 6c c3 a9 00

Also, here are a couple images that show some common "mis-renderings" that can happen in various editors, even though the underlying bytes are well-formed UTF8. If you see any of these renderings, it probably means that you correctly produced a UTF8 string, but that your editor/viewer is trying to interpret them under some encoding other than UTF8.

Sample Renderings Num. 1

Sample Renderings Num. 2

0 讨论(0)

借酒劲吻你

2021-02-05 09:51
There is a regular expression to test if a string is valid UTF-8:
```
$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;
```
But this doesn’t ensure that the text actual is UTF-8.

An example: The byte sequence for the letter ö (U+00F6) and the corresponding UTF-8 sequence is 0xC3B6.
So when you get 0xC3B6 as input you can say that it is valid UTF-8. But you cannot surely say that the letter ö has been submitted.
This is because imagine that not UTF-8 has been used but ISO 8859-1 instead. There the sequence 0xC3B6 represents the character Ã (0xC3) and ¶ (0xB6) respectivly.
So the sequence 0xC3B6 can either represent ö using UTF-8 or Ã¶ using ISO 8859-1 (although the latter is rather unusual).

So in the end it’s only guessing.
0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2021-02-05 09:52

In PHP we use the mb_ functions such as mb_detect_encoding() and mb_convert_encoding(). They aren't perfect, but they get us 99.9% of the way there. Than we have a few regular expressions to strip out funky characters that somehow make there way in at times.

If you are going international, you definitely want to use UTF-8. We have yet to find the perfect solution for getting all of our data into UTF-8, and i'm not sure one exists. You just have to keep tinkering with it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2021-02-05 10:01

The real troublemaker with character encoding is quite often that there are multiple encoding-related bugs and that some incorrect behavior has been introduced because of other bugs. I have no count of how many times I have seen this happen.

The goal, as always, is to handle it correctly in every single place. So most of the time simple unit tests can do the trick, it doesn't even have to be very complex character sets. I find all out bugs just by testing on our national character "ø", because it maps differently in UTF-8 and most of the other character sets.

The aggregate works fine when all the pieces do it correctly. I know this sounds trivial, but when it comes to character set issues it's always worked for me ;)

0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2021-02-05 10:08

Localization is pretty tough.

I think you are really asking two questions. One of them, how do you get everybody to correctly work on an i8n application, is not technical, but a project management issue in my opinion. If you want people to use a common standard, like UTF-8, then you will simply have to enforce that. Tools will help but people will first need to be told to do so.

Besides saying that UTF-8 is in my opinion the way to go, it is hard to give an answer to the questions about tools. It really depends on the kind of project you are doing. If it for example is a Java project that you are talking about then it is a simple matter of properly configuring the IDE to encode files in UTF-8. And to make sure your UTF-8 localizations are in external resource files.

One thing you can certainly do is to make unit tests that check compliance. If your localized messages/labels are in resource files then it is faily easy to check if they are properly UTF-8 encoded I think.

0 讨论(0)
发布评论:

提交评论
- 加载中...