Java Replace Unicode Characters in a String

大憨熊 提交于 2020-01-01 19:31:21

问题


I have a string which contains multiple unicode characters. I want to identify all these unicode characters, ex: \ uF06C, and replace it with a back slash and four hexa digits without "u" in it.

Example:

Source String: "add \uF06Cd1 Clause"

Result String: "add \F06Cd1 Clause"

How can achieve this in Java?

Edit:

Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work.


回答1:


The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.

The regex to match the unicode-string:

A unicode-character looks like \uABCD, so \u, followed by a 4-character hexnumber string. Matching these can be done using

\\u[A-Fa-f\d]{4}

But there's a problem with this:
In a String like "just some \\uabcd arbitrary text" the \u would still get matched. So we need to make sure the \u is preceeded by an even number of \s:

(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}

Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:

(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})

As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:

$1\\$3

Now for the actual code:

String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";

Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);

That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.

EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:

StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
   if(c > 127)
       sb.append("\\").append(String.format("%04x", (int) c));
   else
       sb.append(c);

This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.




回答2:


Try using String.replaceAll() method

s = s.replaceAll("\u", "\");



来源:https://stackoverflow.com/questions/41667632/java-replace-unicode-characters-in-a-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!