How do browsers determine the encoding used?

后端未结

关注

 3  1462

情书的邮戳

I do understand there are 2 ways to set the encoding:

By using Content-Type header.
By using meta tags in HTML

Since Content-Type header is

相关标签:

3条回答

忘掉有多难

2021-02-06 16:24

I've encountered problem with output encoding of HTML. If you are creating website or webservice with .i.e nodejs or golang... and you're not sure just add Content-Type charset to header: For example in golang: resp.Header.Set("Content-Type", "text/html; charset=GB18030");

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2021-02-06 16:26

It is set in the <head> like this:

<meta charset="UTF-8">

I think if this is not set in the head the browser will set a default encoding.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2021-02-06 16:48

They can guess it based on heuristic

I don't know how good are compilers today at encoding detection but MS Word did a very good job at it and recognizes even charsets I've never heard before. You can just open a *.txt file with random encoding and see.

This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection.

https://en.wikipedia.org/wiki/Charset_detection

Firefox uses the Mozilla Charset Detectors. The way it works is explained here and you can also change its heuristic preferences

Chrome previously used ICU detector but switched to CED almost 2 years ago

None of the detection algorithms are perfect, they can guess it incorrectly like this, because it's just guessing anyway!

This process is not foolproof because it depends on statistical data.

so that's how the famous Bush hid the facts bug occurred. Bad guessing also introduces a vulnerability to the system

For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS past filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.

http://htmlpurifier.org/docs/enduser-utf8.html#fixcharset-none

As a result, the encoding should always be explicitly stated.

0 讨论(0)
发布评论:

提交评论
- 加载中...

How do browsers determine the encoding used?

They can guess it based on heuristic