Check if a String is valid UTF-8 encoded in Java

后端 未结 2 1478
既然无缘
既然无缘 2020-11-30 02:04

How can I check if a string is in valid UTF-8 format?

相关标签:
2条回答
  • 2020-11-30 02:25

    The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html.

    The StringConverter program starts by creating a String containing Unicode characters:

    String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
    

    When printed, the String named original appears as:

    AêñüC
    

    To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:

    try {
        byte[] utf8Bytes = original.getBytes("UTF8");
        byte[] defaultBytes = original.getBytes();
    
        String roundTrip = new String(utf8Bytes, "UTF8");
        System.out.println("roundTrip = " + roundTrip);
        System.out.println();
        printBytes(utf8Bytes, "utf8Bytes");
        System.out.println();
        printBytes(defaultBytes, "defaultBytes");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    

    The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes. The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java. Here is the printBytes method:

    public static void printBytes(byte[] array, String name) {
        for (int k = 0; k < array.length; k++) {
            System.out.println(name + "[" + k + "] = " + "0x" +
                UnicodeFormatter.byteToHex(array[k]));
        }
    }
    

    The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:

    utf8Bytes[0] = 0x41
    utf8Bytes[1] = 0xc3
    utf8Bytes[2] = 0xaa
    utf8Bytes[3] = 0xc3
    utf8Bytes[4] = 0xb1
    utf8Bytes[5] = 0xc3
    utf8Bytes[6] = 0xbc
    utf8Bytes[7] = 0x43
    defaultBytes[0] = 0x41
    defaultBytes[1] = 0xea
    defaultBytes[2] = 0xf1
    defaultBytes[3] = 0xfc
    defaultBytes[4] = 0x43
    
    0 讨论(0)
  • 2020-11-30 02:28

    Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.

    Also only byte arrays can be UTF-8 encoded.

    Here is a common case of UTF-8 conversions.

    String myString = "\u0048\u0065\u006C\u006C\u006F World";
    System.out.println(myString);
    byte[] myBytes = null;
    
    try 
    {
        myBytes = myString.getBytes("UTF-8");
    } 
    catch (UnsupportedEncodingException e)
    {
        e.printStackTrace();
        System.exit(-1);
    }
    
    for (int i=0; i < myBytes.length; i++) {
        System.out.println(myBytes[i]);
    }
    

    If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.

    0 讨论(0)
提交回复
热议问题