Getting true UTF-8 characters in Java JNI

后端 未结 1 1366
灰色年华
灰色年华 2021-01-31 19:02

Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?

Unfortunately GetStringUTFChars() almost does what\'s required but not qui

相关标签:
1条回答
  • 2021-01-31 19:52

    This is clearly explained in the Java documentation:

    JNI Functions

    GetStringUTFChars

    const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);
    

    Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().

    Modified UTF-8

    The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.

    All characters in the range \u0001 to \u007F are represented by a single byte, as follows:

    The seven bits of data in the byte give the value of the character represented.

    The null character ('\u0000') and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes x and y:

    The bytes represent the character with the value ((x & 0x1f) << 6) + (y & 0x3f).

    Characters in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z:

    The character with the value ((x & 0xf) << 12) + ((y & 0x3f) << 6) + (z & 0x3f) is represented by the bytes.

    Characters with code points above U+FFFF (so-called supplementary characters) are represented by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes. This means, supplementary characters are represented by six bytes, u, v, w, x, y, and z:

    The character with the value 0x10000+((v&0x0f)<<16)+((w&0x3f)<<10)+(y&0x0f)<<6)+(z&0x3f) is represented by the six bytes.

    The bytes of multibyte characters are stored in the class file in big-endian (high byte first) order.

    There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.

    For more information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0.

    Since U+1F604 is a supplementary character, and Java does not support UTF-8's 4-byte encoding format, U+1F604 is represented in modified UTF-8 by encoding the UTF-16 surrogate pair U+D83D U+DE04 using 3 bytes per surrogate, thus 6 bytes total.

    So, to answer your question...

    Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?

    You can either:

    1. Use GetStringChars() to get the original UTF-16 encoded characters, and then create your own UTF-8 byte array from that. The conversion from UTF-16 to UTF-8 is a very simply algorithm to implement by hand, or you can use any pre-existing implementation provided by your platform or 3rd party libraries.

    2. Have your JNI code call back into Java to invoke the String.getBytes(String charsetName) method to encode the jstring object to a UTF-8 byte array, eg:

      JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
      {
          const jclass stringClass = env->GetObjectClass(_s);
          const jmethodID getBytes = env->GetMethodID(stringClass, "getBytes", "(Ljava/lang/String;)[B");
      
          const jstring charsetName = env->NewStringUTF("UTF-8");
          const jbyteArray stringJbytes = (jbyteArray) env->CallObjectMethod(_s, getBytes, charsetName);
          env->DeleteLocalRef(charsetName);
      
          const jsize length = env->GetArrayLength(stringJbytes);
          const jbyte* pBytes = env->GetByteArrayElements(stringJbytes, NULL); 
      
          for (int i = 0; i < length; ++i)
              fprintf(stderr, "%d: %02x\n", i, pBytes[i]);
      
          env->ReleaseByteArrayElements(stringJbytes, pBytes, JNI_ABORT); 
          env->DeleteLocalRef(stringJbytes);
      }
      

    The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8

    Java's Modified UTF-8 is not exactly the same as CESU-8:

    CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

    0 讨论(0)
提交回复
热议问题