How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

前端 未结 3 931
无人及你
无人及你 2021-02-05 08:02

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.

I\'m looking a clean way to replace these charac

相关标签:
3条回答
  • 2021-02-05 08:03

    5 byte utf-8 sequences begin with a 111110xx-byte and 6 byte utf-8 sequences begin with a 1111110x-byte. Important to note is, that no follow-up bytes of 1-4-byte utf-8 sequences contain bytes that large because follow-up bytes are always of the form 10xxxxxx.

    Therefore you can just go through the bytes and every time you see a byte of kind 111110xx then only emit a '?' to the output-stream/array while skipping the next 4 bytes from the input; analogue for the 6-byte-sequences.

    0 讨论(0)
  • 2021-02-05 08:10

    We ended up implementing the following method in Java for this problem. Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.

    The offset calculations are to make sure we stay on the unicode code points.

    public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
    public static final String REPLACEMENT_CHAR = "\uFFFD"; 
    
    public static String toValid3ByteUTF8String(String s)  {
        final int length = s.length();
        StringBuilder b = new StringBuilder(length);
        for (int offset = 0; offset < length; ) {
           final int codepoint = s.codePointAt(offset);
    
           // do something with the codepoint
           if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
               b.append(CharUtils.REPLACEMENT_CHAR);
           } else {
               if (Character.isValidCodePoint(codepoint)) {
                   b.appendCodePoint(codepoint);
               } else {
                   b.append(CharUtils.REPLACEMENT_CHAR);
               }
           }
           offset += Character.charCount(codepoint);
        }
        return b.toString();
    }
    
    0 讨论(0)
  • 2021-02-05 08:13

    Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:

    text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
    
    0 讨论(0)
提交回复
热议问题