Java modified UTF-8 strings in Python

前端 未结 4 964
广开言路
广开言路 2021-01-12 15:24

I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain UTF-8 strings. Java uses a modified UTF-8 encoding in

4条回答
  •  悲哀的现实
    2021-01-12 15:26

    You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8. On the Python side, you can just handle it like this,

    1. Convert the string into normal UTF-8 and stores bytes in a buffer.
    2. Write the 2-byte buffer length (not the string length) as binary in big-endian.
    3. Write the whole buffer.

    I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5).

    MUTF-8 is mainly used for JNI and other systems with null-terminated strings. The only difference from normal UTF-8 is how U+0000 is encoded. Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80). First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text. Secondly, DataInputStream.readUTF() doesn't enforce the encoding so it happily accepts either one.

    EDIT: The Python code should look like this,

    def writeUTF(data, str):
        utf8 = str.encode('utf-8')
        length = len(utf8)
        data.append(struct.pack('!H', length))
        format = '!' + str(length) + 's'
        data.append(struct.pack(format, utf8))
    

提交回复
热议问题