Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?
The arr
If the byte array begins with a Byte Order Mark (BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.
If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).
If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.