C++ How to inspect file Byte Order Mark in order to get if it is UTF-8?

半腔热情 提交于 2019-12-05 07:57:43
Ian Clelland

In general, you can't.

The presence of a Byte Order Mark is a very strong indication that the file you are reading is Unicode. If you are expecting a text file, and the first four bytes you receive are:

0x00, 0x00, 0xfe, 0xff -- The file is almost certainly UTF-32BE
0xff, 0xfe, 0x00, 0x00 -- The file is almost certainly UTF-32LE
0xfe, 0xff,  XX,   XX     -- The file is almost certainly UTF-16BE
0xff, 0xfe,  XX,   XX (but not 00, 00) -- The file is almost certainly UTF-16LE
0xef, 0xbb, 0xbf,  XX   -- The file is almost certainly UTF-8 With a BOM

But what about anything else? If the bytes you get are anything other than one of these five patterns, then you can't say for certain that your file is or is not UTF-8.

In fact, any text document containing only ASCII characters from 0x00 to 0x7f is a valid UTF-8 document, as well as being a plain ASCII document.

There are heuristics that can try to infer, based on the particular characters that are seen, whether a document is encoded in, say, ISO-8859-1, or UTF-8, or CP1252, but in general, the first two, three, or four bytes of a file are not enough to say whether what you are looking at is definitely UTF-8.

John

0xEF,0xBB,0xBF

ordering doesn't depend on endianness.

How you read the file with C++ is up to you. Personally I still use C-style File methods because they are provided by the library I am coding with and I can be sure to specify to binary mode and avoid unintended translations down the line.

adapted from cs.vt.edu

#include <fstream>
...
char buffer[100];
ifstream myFile ("data.bin", ios::in | ios::binary);
myFile.read (buffer, 3);
if (!myFile) {
    // An error occurred!
    // myFile.gcount() returns the number of bytes read.
    // calling myFile.clear() will reset the stream state
    // so it is usable again.
}
...
if (!myFile.read (buffer, 100)) {
    // Same effect as above
}
if (buffer[0] == 0XEF && buffer[1] == 0XBB && buffer[2] == 0XBF) {
    //Congrats, UTF-8
}

Alternatively, many format use UTF-8 by default if no other BOM (UTF-16, or UTF-32 for example) are specified.

wiki for BOM

unicode.org.faq

user2622198
if (buffer[0] == '\xEF' && buffer[1] == '\xBB' && buffer[2] == '\xBF') {
    // UTF-8
}

It's better to use buffer[0] == '\xEF' instead of buffer[0] == 0xEF in order to avoid signed/unsigned char problems, see How do I represent negative char values in hexadecimal?

This is my version in C++:

#include <fstream>

/* Reads a leading BOM from file stream if it exists.
 * Returns true, iff the BOM has been there. */
bool ReadBOM(std::ifstream & is)
{
  /* Read the first byte. */
  char const c0 = is.get();
  if (c0 != '\xEF') {
    is.putback(c0);
    return false;
  }

  /* Read the second byte. */
  char const c1 = is.get();
  if (c1 != '\xBB') {
    is.putback(c1);
    is.putback(c0);
    return false;
  }

  /* Peek the third byte. */
  char const c2 = is.peek();
  if (c2 != '\xBF') {
    is.putback(c1);
    is.putback(c0);
    return false;
  }

  return true; // This file contains a BOM for UTF-8.
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!