How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

前端 未结 3 933
无人及你
无人及你 2021-02-05 08:02

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.

I\'m looking a clean way to replace these charac

3条回答
  •  故里飘歌
    2021-02-05 08:03

    5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.

    Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.

提交回复
热议问题