What's the difference between UTF-8 and UTF-8 without BOM?

前端 未结 21 1391
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-21 05:45

What\'s different between UTF-8 and UTF-8 without a BOM? Which is better?

相关标签:
21条回答
  • 2020-11-21 06:23

    UTF with a BOM is better if you use UTF-8 in HTML files and if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or some exotic language on the same page.

    That is my opinion (30 years of computing and IT industry).

    0 讨论(0)
  • 2020-11-21 06:24

    Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2

    "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

    0 讨论(0)
  • 2020-11-21 06:24

    As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.

    Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.

    Also, I just found this in Wikipedia:

    The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns

    0 讨论(0)
  • 2020-11-21 06:27

    What's different between UTF-8 and UTF-8 without BOM?

    Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

    Long answer:

    Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

    UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

    Which is better?

    Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

    A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.

    0 讨论(0)
  • 2020-11-21 06:27

    BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.

    It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.

    0 讨论(0)
  • 2020-11-21 06:27

    One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:

    #!/bin/bash: No such file or directory
    

    in response to the shebang line specifying which shell you wish to use:

    #!/bin/bash
    

    If you save as UTF-8, no BOM (say in BBEdit) all will be well.

    0 讨论(0)
提交回复
热议问题