I have dirty data. Sometimes it contains characters like this. I use this data to make queries like
WHERE a.address IN (\'mydatahere\')
For
You can filter surrogate characters with this regex:
String str = "
May be this will help someone as it helped me.
public static String removeBadChars(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
for(int i=0;i<s.length();i++){
if (Character.isHighSurrogate(s.charAt(i))) continue;
sb.append(s.charAt(i));
}
return sb.toString();
}
Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(values[i].replaceAll(
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
, ""));
}
or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:
String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
System.out.println(Pattern.matches(
".*(" +
//"[\\\\x00-\\\\x7F]|" + //single-byte sequences 0xxxxxxx - commented because of capitol letters
"[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences 110xxxxx 10xxxxxx
"[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences 1110xxxx 10xxxxxx * 2
"[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
+ ").*"
, values[i]));
}
For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.
You can encode and then decode it to/from UTF-8:
String label = "look into my eyes 〠.〠";
Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();
System.out.println(label);
output:
look into my eyes ?.?
edit: I think this might only work on Java 6.
When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:
use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
print Encode::decode('UTF-8', $_);
}
This script takes (possibly corrupted) UTF-8 on stdin
and re-prints valid UTF-8 to stdout
. Invalid characters are replaced with �
(U+FFFD
, Unicode replacement character).
If you run this script on good UTF-8 input, output should be identical to input.
If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.
This is Perl one-liner version of this same script:
perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt
EDIT: Added Java-only solution.
This is an example how to do this in Java:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
public class UtfFix {
public static void main(String[] args) throws InterruptedException, CharacterCodingException {
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
ByteBuffer bb = ByteBuffer.wrap(new byte[] {
(byte) 0xD0, (byte) 0x9F, // 'П'
(byte) 0xD1, (byte) 0x80, // 'р'
(byte) 0xD0, // corrupted UTF-8, was 'и'
(byte) 0xD0, (byte) 0xB2, // 'в'
(byte) 0xD0, (byte) 0xB5, // 'е'
(byte) 0xD1, (byte) 0x82 // 'т'
});
CharBuffer parsed = decoder.decode(bb);
System.out.println(parsed);
// this prints: Пр?вет
}
}
In PHP - I approach this by only allowing printable data. This really helps in cleaning the data for DB.
It's pre-processing though and sometimes you don't have that luxury.
$str = preg_replace('/[[:^print:]]/', '', $str);