What's the difference between UTF-8 and UTF-8 without BOM?

前端未结

关注

 21  1435

佛祖请我去吃肉

What\'s different between UTF-8 and UTF-8 without a BOM? Which is better?

相关标签:

21条回答

臣服心动

2020-11-21 06:14
This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.

As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

Why is a BOM not recommended?
1. Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
2. It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)
When should you encode with a BOM?

If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.
0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2020-11-21 06:15
There are at least three problems with putting a BOM in UTF-8 encoded files.
1. Files that hold no text are no longer empty because they always contain the BOM.
2. Files that hold text that is within the ASCII subset of UTF-8 is no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.
3. It is not possible to concatenate several files together because each file now has a BOM at the beginning.
And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:
- It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
- It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
0 讨论(0)
发布评论:

提交评论
- 加载中...
误落风尘

2020-11-21 06:15
The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:
Q: How I should deal with BOMs?

A: Here are some guidelines to follow:
1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
  - Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
  - Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-11-21 06:17
From http://en.wikipedia.org/wiki/Byte-order_mark:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.

My real problem with the absence of BOM is the following. Suppose we've got a file which contains:
```
abc
```
Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:
```
abg-αβγ
```
Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.
0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2020-11-21 06:19

UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.

If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.

Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003 at http://tools.ietf.org/html/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)

0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2020-11-21 06:20

I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.

I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.

Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.

I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.

0 讨论(0)
发布评论:

提交评论
- 加载中...

What's the difference between UTF-8 and UTF-8 without BOM?

Why is a BOM not recommended?

When should you encode with a BOM?