How can I check whether a byte array contains a Unicode string in Java?

前端 未结 7 1193
再見小時候
再見小時候 2021-02-19 04:14

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?

The arr

7条回答
  •  面向向阳花
    2021-02-19 04:54

    Here's a way to use the UTF-8 "binary" regex from the W3C site

    static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException 
    {
      Pattern p = Pattern.compile("\\A(\n" +
        "  [\\x09\\x0A\\x0D\\x20-\\x7E]             # ASCII\\n" +
        "| [\\xC2-\\xDF][\\x80-\\xBF]               # non-overlong 2-byte\n" +
        "|  \\xE0[\\xA0-\\xBF][\\x80-\\xBF]         # excluding overlongs\n" +
        "| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}  # straight 3-byte\n" +
        "|  \\xED[\\x80-\\x9F][\\x80-\\xBF]         # excluding surrogates\n" +
        "|  \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}      # planes 1-3\n" +
        "| [\\xF1-\\xF3][\\x80-\\xBF]{3}            # planes 4-15\n" +
        "|  \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}      # plane 16\n" +
        ")*\\z", Pattern.COMMENTS);
    
      String phonyString = new String(utf8, "ISO-8859-1");
      return p.matcher(phonyString).matches();
    }
    

    As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.

    As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.

    And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.

    UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.

提交回复
热议问题