How to remove surrogate characters in Java?

心已入冬 提交于 2019-11-27 12:27:38

问题


I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.

I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.

Thanks in advance for your help.

public static String removeSurrogates(String query) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < query.length() - 1; i++) {
        char firstChar = query.charAt(i);
        char nextChar = query.charAt(i+1);
        if (Character.isSurrogatePair(firstChar, nextChar) == false) {
            sb.append(firstChar);
        } else {
            i++;
        }
    }
    if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
            && Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
        sb.append(query.charAt(query.length() - 1));
    }

    return sb.toString();
}

回答1:


Here's a couple things:

  • Character.isSurrogate(char c):

    A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

  • Checking for pairs seems pointless, why not just remove all surrogates?

  • x == false is equivalent to !x

  • StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).

I suggest this:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char c = query.charAt(i);
        // !isSurrogate(c) in Java 7
        if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
            sb.append(firstChar);
        }
    }
    return sb.toString();
}

Breaking down the if statement

You asked about this statement:

if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
    sb.append(firstChar);
}

One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

static boolean isSurrogate(char c) {
    return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}

static boolean isNotSurrogate(char c) {
    return !isSurrogate(c);
}

...

if (isNotSurrogate(c)) {
    sb.append(firstChar);
}



回答2:


Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).

Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:

return query.replaceAll("[^\u0000-\uffff]", "");



回答3:


why not simply

for (int i = 0; i < query.length(); i++) 
    char c = query.charAt(i);
    if(!isHighSurrogate(c) && !isLowSurrogate(c))
        sb.append(c);

you probably should replace them with "?", instead of out right erasing them.




回答4:


Just curious. If char is high surrogate is there a need to check the next one? It is supposed to be low surrogate. The modified version would be:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char ch = query.charAt(i);
        if (Character.isHighSurrogate(ch))
            i++;//skip the next char is it's supposed to be low surrogate
        else
            sb.append(ch);
    }    
    return sb.toString();
}



回答5:


if remove, all these solutions are useful but if repalce, below is better

StringBuffer sb = new StringBuffer();
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if(Character.isHighSurrogate(c)){
            sb.append('*');
        }else if(!Character.isLowSurrogate(c)){
            sb.append(c);
        }
    }
    return sb.toString();


来源:https://stackoverflow.com/questions/12867000/how-to-remove-surrogate-characters-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!