How to remove bad characters that are not suitable for utf8 encoding in MySQL?

前端 未结 6 611
刺人心
刺人心 2020-12-16 13:05

I have dirty data. Sometimes it contains characters like this. I use this data to make queries like

WHERE a.address IN (\'mydatahere\')

For

相关标签:
6条回答
  • 2020-12-16 13:40

    You can filter surrogate characters with this regex:

    String str  = "                                                                    
    0 讨论(0)
  • 2020-12-16 13:45

    May be this will help someone as it helped me.

    public static String removeBadChars(String s) {
      if (s == null) return null;
      StringBuilder sb = new StringBuilder();
      for(int i=0;i<s.length();i++){ 
        if (Character.isHighSurrogate(s.charAt(i))) continue;
        sb.append(s.charAt(i));
      }
      return sb.toString();
    }
    
    0 讨论(0)
  • 2020-12-16 13:46

    Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

    String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
    for (int i = 0; i < values.length; i++) {
        System.out.println(values[i].replaceAll(
                        //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                        "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                , ""));
    }
    

    or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

    String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
    for (int i = 0; i < values.length; i++) {
        System.out.println(Pattern.matches(
                        ".*(" +
                        //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                        "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                        + ").*"
                , values[i]));
    }
    

    For making a whole web app be UTF8 compatible read here:
    How to get UTF-8 working in Java webapps
    More on Byte Encodings and Strings.
    You can check your pattern here.
    The same in PHP here.

    0 讨论(0)
  • 2020-12-16 13:48

    You can encode and then decode it to/from UTF-8:

    String label = "look into my eyes 〠.〠";
    
    Charset charset = Charset.forName("UTF-8");
    label = charset.decode(charset.encode(label)).toString();
    
    System.out.println(label);
    

    output:

    look into my eyes ?.?
    

    edit: I think this might only work on Java 6.

    0 讨论(0)
  • 2020-12-16 13:56

    When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:

    use Encode;
    binmode(STDOUT, ":utf8");
    while (<>) {
        print Encode::decode('UTF-8', $_);
    }
    

    This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with (U+FFFD, Unicode replacement character).

    If you run this script on good UTF-8 input, output should be identical to input.

    If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.

    This is Perl one-liner version of this same script:

    perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt
    

    EDIT: Added Java-only solution.

    This is an example how to do this in Java:

    import java.nio.ByteBuffer;
    import java.nio.CharBuffer;
    import java.nio.charset.CharacterCodingException;
    import java.nio.charset.Charset;
    import java.nio.charset.CharsetDecoder;
    import java.nio.charset.CodingErrorAction;
    
    public class UtfFix {
        public static void main(String[] args) throws InterruptedException, CharacterCodingException {
            CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
            decoder.onMalformedInput(CodingErrorAction.REPLACE);
            decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
            ByteBuffer bb = ByteBuffer.wrap(new byte[] {
                (byte) 0xD0, (byte) 0x9F, // 'П'
                (byte) 0xD1, (byte) 0x80, // 'р'
                (byte) 0xD0,              // corrupted UTF-8, was 'и'
                (byte) 0xD0, (byte) 0xB2, // 'в'
                (byte) 0xD0, (byte) 0xB5, // 'е'
                (byte) 0xD1, (byte) 0x82  // 'т'
            });
            CharBuffer parsed = decoder.decode(bb);
            System.out.println(parsed);
            // this prints: Пр?вет
        }
    }
    
    0 讨论(0)
  • 2020-12-16 13:56

    In PHP - I approach this by only allowing printable data. This really helps in cleaning the data for DB.
    It's pre-processing though and sometimes you don't have that luxury.

    $str = preg_replace('/[[:^print:]]/', '', $str);
    
    0 讨论(0)
提交回复
热议问题