What's the difference between UTF-8 and UTF-8 without BOM?

前端 未结 21 1480
佛祖请我去吃肉
佛祖请我去吃肉 2020-11-21 05:45

What\'s different between UTF-8 and UTF-8 without a BOM? Which is better?

21条回答
  •  無奈伤痛
    2020-11-21 06:27

    What's different between UTF-8 and UTF-8 without BOM?

    Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

    Long answer:

    Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

    UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

    Which is better?

    Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

    A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.

提交回复
热议问题