What's the difference between UTF-8 and UTF-8 without BOM?

前端 未结 21 1387
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-21 05:45

What\'s different between UTF-8 and UTF-8 without a BOM? Which is better?

相关标签:
21条回答
  • 2020-11-21 06:01

    When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.

    But this is not the case when we have text, CSV and XML files, either on Windows or Linux.

    For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.

    Save it as XML and declare it as UTF-8:

    <?xml version="1.0" encoding="UTF-8"?>
    

    It will not display (it will not be be read) correctly, even if it's declared as UTF-8.

    I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file

    $file="\xEF\xBB\xBF".$string;
    

    I was not able to save the French letters in an XML file.

    0 讨论(0)
  • 2020-11-21 06:05

    The other excellent answers already answered that:

    • There is no official difference between UTF-8 and BOM-ed UTF-8
    • A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
    • Those bytes, if present, must be ignored when extracting the string from the file/stream.

    But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

    For example, the data [EF BB BF 41 42 43] could either be:

    • The legitimate ISO-8859-1 string "ABC"
    • The legitimate UTF-8 string "ABC"

    So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

    Encodings should be known, not divined.

    0 讨论(0)
  • 2020-11-21 06:05

    Here are examples of the BOM usage that actually cause real problems and yet many people don't know about it.

    BOM breaks scripts

    Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:

    #!/bin/sh
    #!/usr/bin/python
    #!/usr/local/bin/perl
    #!/usr/bin/env node
    

    It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.

    See Wikipedia, article: Shebang, section: Magic number:

    The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 and 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[14] for this reason and for wider interoperability and philosophical concerns. Additionally, a byte order mark is not necessary in UTF-8, as that encoding does not have endianness issues; it serves only to identify the encoding as UTF-8. [emphasis added]

    BOM is illegal in JSON

    See RFC 7159, Section 8.1:

    Implementations MUST NOT add a byte order mark to the beginning of a JSON text.

    BOM is redundant in JSON

    Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).

    BOM breaks JSON parsers

    Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:

    Determining the encoding and endianness of JSON, examining the first four bytes for the NUL byte:

    00 00 00 xx - UTF-32BE
    00 xx 00 xx - UTF-16BE
    xx 00 00 00 - UTF-32LE
    xx 00 xx 00 - UTF-16LE
    xx xx xx xx - UTF-8
    

    Now, if the file starts with BOM it will look like this:

    00 00 FE FF - UTF-32BE
    FE FF 00 xx - UTF-16BE
    FF FE 00 00 - UTF-32LE
    FF FE xx 00 - UTF-16LE
    EF BB BF xx - UTF-8
    

    Note that:

    1. UTF-32BE doesn't start with three NULs, so it won't be recognized
    2. UTF-32LE the first byte is not followed by three NULs, so it won't be recognized
    3. UTF-16BE has only one NUL in the first four bytes, so it won't be recognized
    4. UTF-16LE has only one NUL in the first four bytes, so it won't be recognized

    Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.

    Additionally, if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8, because it doesn't start with an ASCII character < 128 as it should according to the RFC.

    Other data formats

    BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.

    For other data formats than JSON, take a look at how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.

    Other uses of BOM

    As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization, because it is an example of BOM characters causing real problems.

    0 讨论(0)
  • 2020-11-21 06:09

    UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.

    I want to make it clear that I prefer to not have the BOM at all. Add it in if some old rubbish breaks without it, and replacing that legacy application is not feasible.

    Don't make anything expect a BOM for UTF-8.

    0 讨论(0)
  • 2020-11-21 06:11

    Here is my experience with Visual Studio, Sourcetree and Bitbucket pull requests, which has been giving me some problems:

    So it turns out BOM with a signature will include a red dot character on each file when reviewing a pull request (it can be quite annoying).

    If you hover on it, it will show a character like "ufeff", but it turns out Sourcetree does not show these types of bytemarks, so it will most likely end up in your pull requests, which should be ok because that's how Visual Studio 2017 encodes new files now, so maybe Bitbucket should ignore this or make it show in another way, more info here:

    Red dot marker BitBucket diff view

    0 讨论(0)
  • 2020-11-21 06:14

    The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

    Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

    According to the Unicode standard, the BOM for UTF-8 files is not recommended:

    2.6 Encoding Schemes

    ... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

    0 讨论(0)
提交回复
热议问题