In Java: why some Stream methods take int instead of byte or even char?

问题

Why some methods that write bytes/chars to streams takes int instead of byte/char??

Someone told me in case of int instead of char: because char in java is just 2 bytes length, which is OK with most character symbols already in use, but for certain character symbols (chines or whatever), the character is being represented in more than 2 bytes, and hence we use int instead.

How far this explanation is close to the truth?

EDIT: I use the stream word to represent Binary and character streams (not Just Binary streams)

Thanks.

回答1:

Someone told me in case of int instead of char: because char in java is just 2 bytes length, which is OK with most character symbols already in use, but for certain character symbols (chinese or whatever), the character is being represented in more than 2 bytes, and hence we use int instead.

Assuming that at this point you are talking specifically about the Reader.read() method, the statement from "someone" that you have recounted is in fact incorrect.

It is true that some Unicode codepoints have values greater than 65535 and therefore cannot be represented as a single Java char. However, theReader API actually produces a sequence of Java char values (or -1), not a sequence of Unicode codepoints. This clearly stated in the javadoc.

If your input includes a (suitably encoded) Unicode code point that is greater than 65535, then you will actually need to call the read() method twice to see it. What you will get will be a UTF-16 surrogate pair; i.e. two Java char values that together represent the codepoint. In fact, this fits in with the way that the Java String, StringBuilder and StringBuffer classes all work; they all use a UTF-16 based representation ... with embedded surrogate pairs.

The real reason that Reader.read() returns an int not a char is to allow it to return -1 to signal that there are no more characters to be read. The same logic explains why InputStream.read() returns an int not a byte.

Hypothetically, I suppose that the Java designers could have specified that the read() methods throw an exception to signal the "end of stream" condition. However, that would have just replaced one potential source of bugs (failure to test the result) with another (failure to deal with the exception). Besides, exceptions are relatively expensive, and an end of stream is not really an unexpected / exceptional event. In short, the current approach is better, IMO.

(Another clue to the 16 bit nature of the Reader API is the signature of the read(char[], ...) method. How would that deal with codepoints greater than 65535 if surrogate pairs weren't used?)

EDIT

The case of DataOutputStream.writeChar(int) does seem a bit strange. However, the javadoc clearly states that the argument is written as a 2 byte value. And in fact, the implementation clearly writes only the bottom two bytes to the underlying stream.

I don't think that there is a good reason for this. Anyway, there is a bug database entry for this (4957024), which marked as "11-Closed, Not a Defect" with the following comment:

"This isn't a great design or excuse, but it's too baked in for us to change."

... which is kind of an an acknowledgement that it is a defect, at least from the design perspective.

But this is not something worth making a fuss about, IMO.

回答2:

I'm not sure exactly what you're referring to but perhaps you are thinking of InputStream.read()? It returns an integer instead of a byte because the return value is overloaded to also represent end of stream, which is represented as -1. Since there are 257 different possible return values a byte is insufficient.

Otherwise perhaps you could come with some more specific examples.

回答3:

There are a few possible explanations.

First, as a couple of people have noted, it might be because read() necessarily returns an int, and so it can be seen as elegant to have write() accept an int to avoid casting:

int read = in.read();
if ( read != -1 )
   out.write(read);
//vs
   out.write((byte)read);

Second, it might just be nice to avoid other cases of casting:

//write a char (big-endian)
char c;
out.write(c >> 8);
out.write(c);

//vs
out.write( (byte)(c >> 8) );
out.write( (byte)c );

回答4:

It's correct that the maximum possible code point is 0x10FFFF, which doesn't fit in a char. However, the stream methods are byte-oriented, while the writer methods are 16-bit. OutputStream.write(int) writes a single byte, and Writer.write(int) only looks at the low-order 16 bits.

回答5:

In Java, Streams are for raw bytes. To write characters, you wrap a Stream in a Writer.

While Writers do have write(int) (which writes the 16 low bits; it's an int because byte is too small, and short is too small due to it being signed), you should be using write(char[]) or write(String) instead.

回答6:

probably to be symmetric with the read() method which returns an int. nothing serious.

来源：https://stackoverflow.com/questions/3152757/in-java-why-some-stream-methods-take-int-instead-of-byte-or-even-char

标签

java

character-encoding

streaming

iostream