How to remove surrogate characters in Java?

喜欢而已 提交于 2019-11-28 18:53:41

Here's a couple things:

  • Character.isSurrogate(char c):

    A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

  • Checking for pairs seems pointless, why not just remove all surrogates?

  • x == false is equivalent to !x

  • StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).

I suggest this:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char c = query.charAt(i);
        // !isSurrogate(c) in Java 7
        if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
            sb.append(firstChar);
        }
    }
    return sb.toString();
}

Breaking down the if statement

You asked about this statement:

if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
    sb.append(firstChar);
}

One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

static boolean isSurrogate(char c) {
    return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}

static boolean isNotSurrogate(char c) {
    return !isSurrogate(c);
}

...

if (isNotSurrogate(c)) {
    sb.append(firstChar);
}

Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).

Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:

return query.replaceAll("[^\u0000-\uffff]", "");

why not simply

for (int i = 0; i < query.length(); i++) 
    char c = query.charAt(i);
    if(!isHighSurrogate(c) && !isLowSurrogate(c))
        sb.append(c);

you probably should replace them with "?", instead of out right erasing them.

Just curious. If char is high surrogate is there a need to check the next one? It is supposed to be low surrogate. The modified version would be:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char ch = query.charAt(i);
        if (Character.isHighSurrogate(ch))
            i++;//skip the next char is it's supposed to be low surrogate
        else
            sb.append(ch);
    }    
    return sb.toString();
}

if remove, all these solutions are useful but if repalce, below is better

StringBuffer sb = new StringBuffer();
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if(Character.isHighSurrogate(c)){
            sb.append('*');
        }else if(!Character.isLowSurrogate(c)){
            sb.append(c);
        }
    }
    return sb.toString();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!