How to remove surrogate characters in Java?

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.

I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.

Thanks in advance for your help.

public static String removeSurrogates(String query) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < query.length() - 1; i++) {
        char firstChar = query.charAt(i);
        char nextChar = query.charAt(i+1);
        if (Character.isSurrogatePair(firstChar, nextChar) == false) {
            sb.append(firstChar);
        } else {
            i++;
        }
    }
    if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
            && Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
        sb.append(query.charAt(query.length() - 1));
    }

    return sb.toString();
}

Here's a couple things:

Character.isSurrogate(char c):

A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.
Checking for pairs seems pointless, why not just remove all surrogates?
x == false is equivalent to !x
StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).

I suggest this:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char c = query.charAt(i);
        // !isSurrogate(c) in Java 7
        if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
            sb.append(firstChar);
        }
    }
    return sb.toString();
}

Breaking down the `if` statement

You asked about this statement:

if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
    sb.append(firstChar);
}

One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

static boolean isSurrogate(char c) {
    return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}

static boolean isNotSurrogate(char c) {
    return !isSurrogate(c);
}

...

if (isNotSurrogate(c)) {
    sb.append(firstChar);
}

Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).

Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:

return query.replaceAll("[^\u0000-\uffff]", "");

why not simply

for (int i = 0; i < query.length(); i++) 
    char c = query.charAt(i);
    if(!isHighSurrogate(c) && !isLowSurrogate(c))
        sb.append(c);

you probably should replace them with "?", instead of out right erasing them.

Just curious. If char is high surrogate is there a need to check the next one? It is supposed to be low surrogate. The modified version would be:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char ch = query.charAt(i);
        if (Character.isHighSurrogate(ch))
            i++;//skip the next char is it's supposed to be low surrogate
        else
            sb.append(ch);
    }    
    return sb.toString();
}

if remove, all these solutions are useful but if repalce, below is better

StringBuffer sb = new StringBuffer();
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if(Character.isHighSurrogate(c)){
            sb.append('*');
        }else if(!Character.isLowSurrogate(c)){
            sb.append(c);
        }
    }
    return sb.toString();

来源：https://stackoverflow.com/questions/12867000/how-to-remove-surrogate-characters-in-java

标签

java

string

surrogate-pairs

How to remove surrogate characters in Java?

Breaking down the if statement

Breaking down the `if` statement