Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?
The arr
If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.
If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).
If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.
Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.
It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.
If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.
However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.
Here's a way to use the UTF-8 "binary" regex from the W3C site
static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException
{
Pattern p = Pattern.compile("\\A(\n" +
" [\\x09\\x0A\\x0D\\x20-\\x7E] # ASCII\\n" +
"| [\\xC2-\\xDF][\\x80-\\xBF] # non-overlong 2-byte\n" +
"| \\xE0[\\xA0-\\xBF][\\x80-\\xBF] # excluding overlongs\n" +
"| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2} # straight 3-byte\n" +
"| \\xED[\\x80-\\x9F][\\x80-\\xBF] # excluding surrogates\n" +
"| \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2} # planes 1-3\n" +
"| [\\xF1-\\xF3][\\x80-\\xBF]{3} # planes 4-15\n" +
"| \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2} # plane 16\n" +
")*\\z", Pattern.COMMENTS);
String phonyString = new String(utf8, "ISO-8859-1");
return p.matcher(phonyString).matches();
}
As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[]
is out, too). By decoding the byte[]
as ISO-8859-1, you create a String in which each char
has the same unsigned numeric value as the corresponding byte in the original array.
As others have pointed out, tests like this can only tell you the byte[]
could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL
. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.
And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.
UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.
I think Michael has explained it well in his answer this may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php
function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
Taken it from W3.org
The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.
A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.
Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)
The best we can do here is the following:
In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.
IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.