How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

前端 未结 3 932
无人及你
无人及你 2021-02-05 08:02

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.

I\'m looking a clean way to replace these charac

3条回答
  •  情深已故
    2021-02-05 08:10

    We ended up implementing the following method in Java for this problem. Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.

    The offset calculations are to make sure we stay on the unicode code points.

    public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
    public static final String REPLACEMENT_CHAR = "\uFFFD"; 
    
    public static String toValid3ByteUTF8String(String s)  {
        final int length = s.length();
        StringBuilder b = new StringBuilder(length);
        for (int offset = 0; offset < length; ) {
           final int codepoint = s.codePointAt(offset);
    
           // do something with the codepoint
           if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
               b.append(CharUtils.REPLACEMENT_CHAR);
           } else {
               if (Character.isValidCodePoint(codepoint)) {
                   b.appendCodePoint(codepoint);
               } else {
                   b.append(CharUtils.REPLACEMENT_CHAR);
               }
           }
           offset += Character.charCount(codepoint);
        }
        return b.toString();
    }
    

提交回复
热议问题