Gson Unicode characters conversion to Unicode character codes

前端未结

关注

 2  541

Check out my code below. I have a JSON string which contains Unicode character codes. I convert it to my Java object and then convert it back to JSON string. However, you ca

相关标签:

2条回答

星月不相逢

2020-12-22 01:02

Unfortunately, Gson does not seem to support it. All JSON input/output is concentrated in Gson (as of 2.8.0) JsonReader and JsonWriter respectively. JsonReader can read Unicode escapes using its private readEscapeCharacter method. However, unlike JsonReader, JsonWriter simply writes a string to the backing Writer instance making no character corrections for characters above 127 except \u2028 and   \u2029. The only thing, probably, you can do here is writing a custom escaping Writer so that you could emit Unicode escapes.

final class EscapedWriter
        extends Writer {

    private static final char[] hex = {
            '0', '1', '2', '3',
            '4', '5', '6', '7',
            '8', '9', 'a', 'b',
            'c', 'd', 'e', 'f'
    };

    private final Writer writer;

    // I/O components are usually implemented in not thread-safe manner
    // so we can save some time on constructing a single UTF-16 escape
    private final char[] escape = { '\\', 'u', 0, 0, 0, 0 };

    EscapedWriter(final Writer writer) {
        this.writer = writer;
    }

    // This implementation is not very efficient and is open for enhancements:
    // * constructing a single "normalized" buffer character array so that it could be passed to the downstream writer
    //   rather than writing characters one by one
    // * etc...
    @Override
    public void write(final char[] buffer, final int offset, final int length)
            throws IOException {
        for ( int i = offset; i < length; i++ ) {
            final int ch = buffer[i];
            if ( ch < 128 ) {
                writer.write(ch);
            } else {
                escape[2] = hex[(ch & 0xF000) >> 12];
                escape[3] = hex[(ch & 0x0F00) >> 8];
                escape[4] = hex[(ch & 0x00F0) >> 4];
                escape[5] = hex[ch & 0x000F];
                writer.write(escape);
            }
        }
    }

    @Override
    public void flush()
            throws IOException {
        writer.flush();
    }

    @Override
    public void close()
            throws IOException {
        writer.close();
    }

    // Some java.io.Writer subclasses may use java.lang.Object.toString() to materialize their accumulated state by design
    // so it has to be overridden and forwarded as well
    @Override
    public String toString() {
        return writer.toString();
    }

}

This writer is NOT well-tested, and does not respect \u2028 and \u2029. And then just configure the output destination when invoking the toJson method:

final String input = "{\"description\":\"Tikrovi\\u0161kai para\\u0161ytas k\\u016brinys\"}";
final Book book = gson.fromJson(input, Book.class);
final Writer output = new EscapedWriter(new StringWriter());
gson.toJson(book, output);
System.out.println(input);
System.out.println(output);

Output:

{"description":"Tikrovi\u0161kai para\u0161ytas k\u016brinys"}
{"description":"Tikrovi\u0161kai para\u0161ytas k\u016brinys"}

It's an interesting problem, and you might also raise an issue on google/gson to add a string writing configuration option - or at least to get some comments from the development team. I do believe they are very aware of such a behavior and made it work like that by design, however they could also shed some light on it (the only one I could think of now is that currently they have some more performance not making an additional transformation before writing a string, but it's a weak guess though).

0 讨论(0)

自闭症患者

2020-12-22 01:12
There is a question that is marked as duplicate of this one: unicode characters in json file to be unconverted after managing java gson [duplicate] . I answered that question and the answer was accepted as appropriate solution. So below is a copy of my answer:

Actually, big advantage of unicode characters is that any client reads and treats the code "\u..." just the same as its character representation. For instance if in html file if you replace every single character with its unicode representation the browser will read it as usual. I.e. replace 'H' in "Hello world" with '\u0048' (which is unicode for 'H') and in the browser you will still see "Hello world". But in this case it works against you as Gson simply replaces unicodes with their symbols.

My suggestion may not be perfect but it will work. Before converting your Object remember the location of your unicode symbols and after conversion change them back to unicodes. Here is the tool that may help you: There is an open source library MgntUtils (written by me) that has a utility that converts any string to sequence of unicodes and vise-versa.

You can do:
```
String s = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Hello world");
```
And it will give you String: "\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064" and then you can do this:
```
    String s 
= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064");
```
And it will return you String "Hello world". It works with any language. Here is the link to the article that explains where to get the library: Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison. Look for paragraph titled "String Unicode converter"

Here is the link to Maven artifacts and here is a link to Github with sources and javadoc included.Here is javadoc
0 讨论(0)
发布评论:

提交评论
- 加载中...