How do you identify the file content as being in ASCII or binary using C++?
My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.
If a file contains only the decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.
If, in addition to the above bytes, the file contains only the decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.
If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:
FE FF
, the file is tentatively UTF-16 BE.FF FE
, and the following two bytes are not hex 00 00
, the file is tentatively UTF-16 LE.00 00 FE FF
, the file is tentatively UTF-32 BE.FF FE 00 00
, the file is tentatively UTF-32 LE.If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.
If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:
If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).
This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.
Here is a link the a The Old New Thing Article on how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.
To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary. After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters. You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!). If you need them, you'll have to define.
const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;