I am reinventing the wheel and creating my own JSON parse methods in Java.
I am going by the (very nice!) documentation on json.org. The only part I am unsure about is where it says "or control character"
Since the documentation is so clear, and JSON is so simple and easy to implement, I thought I would go ahead and require the spec instead of being loose.
How would I correctly strip out control characters in Java? Perhaps there is a unicode range?
Edit: A (commonly?) missing peice to the puzzle
I have been informed that there are other control characters outside of the defined range 1 2 that can be troublesome in <script>
tags.
Most notably the characters U+2028 and U+2029, Line and Paragraph Separator, which act as newlines. Injecting a newline into the middle of a string literal will most likely cause a syntax error (unterminated string literal). 3
Though I believe this does not pose an XSS threat, it is still a good idea to add extra rules for the use in <script>
tags.
- Just be simple and encode all non-"ASCII printable" characters with
\u
notation. Those characters are uncommon to begin with. If you like, you could add to the white-list, but I do recommend a white-list approach. - In case you are not aware, do not forget about
</script
(not case sensitive), which could cause HTML script injection to your page with the characters</script><script src=http://tinyurl.com/abcdef>
. None of those characters are by default encoded in JSON.
Will Character.isISOControl(...) do? Incidentally, UTF-16 is an encoding of Unicode codepoints... Are you going to be operating at the byte level, or at the character/codepoint level? I recommend leaving the mapping from UTF-16 to character streams to Java's core APIs...
Even if it's not very specific, I would assume that they refer to the "control" character category from the Unicode specification.
In Java, you can check if a character c
is a Unicode control character with the following expression: Character.getType(c) == Character.CONTROL
.
I believe the Unicode definition of a control character is:
The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.
That's their definition of a control code, but the above is followed by the sentence "Also known as control characters.", so...
I know the question has been asked a couple of years ago, but I am replying anyway, because the accepted answer is not correct.
Character.isISOControl(int codePoint)
does the following check:
(codePoint >= 0x00 && codePoint <= 0x1F) || (codePoint >= 0x7F && codePoint <= 0x9F);
The JSON specification defines at https://tools.ietf.org/html/rfc7159:
Strings
The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
Character.isISOControl(int codePoint)
will flag all characters that need to be escaped (U+0000-U+001F)
, though it will also flag characters that do not need to be escaped (U+007F-U+009F)
. It is not required to escape the characters (U+007F-U+009F)
.
来源:https://stackoverflow.com/questions/6051509/java-detect-control-characters-which-are-not-correct-for-json