Java: detect control characters which are not correct for JSON

前端未结

关注

 4  740

花落未央

I am reinventing the wheel and creating my own JSON parse methods in Java.

I am going by the (very nice!) documentation on json.org. The only part I am unsure about is w

相关标签:

4条回答

野性不改

2021-02-07 11:45
I know the question has been asked a couple of years ago, but I am replying anyway, because the accepted answer is not correct.
```
Character.isISOControl(int codePoint) 
```
does the following check:
```
(codePoint >= 0x00 && codePoint <= 0x1F) || (codePoint >= 0x7F && codePoint <= 0x9F);
```
The JSON specification defines at https://tools.ietf.org/html/rfc7159:
1. Strings
  
  The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
```
Character.isISOControl(int codePoint) 
```
will flag all characters that need to be escaped (U+0000-U+001F), though it will also flag characters that do not need to be escaped (U+007F-U+009F). It is not required to escape the characters (U+007F-U+009F).
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-02-07 11:54

Will Character.isISOControl(...) do? Incidentally, UTF-16 is an encoding of Unicode codepoints... Are you going to be operating at the byte level, or at the character/codepoint level? I recommend leaving the mapping from UTF-16 to character streams to Java's core APIs...

0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2021-02-07 11:54

Even if it's not very specific, I would assume that they refer to the "control" character category from the Unicode specification.

In Java, you can check if a character c is a Unicode control character with the following expression: Character.getType(c) == Character.CONTROL.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2021-02-07 12:03

I believe the Unicode definition of a control character is:

The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.

That's their definition of a control code, but the above is followed by the sentence "Also known as control characters.", so...

0 讨论(0)
发布评论:

提交评论
- 加载中...