We often use fgetc
like this:
int c;
while ((c = fgetc(file)) != EOF)
{
// do stuff
}
Theoretically, if a byte in the file has the value of EOF
, this code is buggy - it will break the loop early and fail to process the whole file. Is this situation possible?
As far as I understand, fgetc
internally casts a byte read from the file to unsigned char
and then to int
, and returns it. This will work if the range of int
is greater than that of unsigned char
.
What happens if it's not (probably then sizeof(int)=1
)?
- Will
fgetc
read a legitimate data equal toEOF
from a file sometimes? - Will it alter the data it read from the file to avoid the single value
EOF
? - Will
fgetc
be an unimplemented function? - Will
EOF
be of another type, likelong
?
I could make my code fool-proof by an extra check:
int c;
for (;;)
{
c = fgetc(file);
if (feof(file))
break;
// do stuff
}
It is necessary if I want maximum portability?
Yes, c = fgetc(file); if (feof(file))
does work for maximum portability. It works in general and also when the unsigned char
and int
have the same number of unique values. This occurs on rare platforms with char
, signed char
, unsigned char
, short
, unsigned short
, int
, unsigned
all using the same bit width and width of range.
Note that feof(file))
is insufficient. Code should also check for ferror(file)
.
int c;
for (;;)
{
c = fgetc(file);
if (c == EOF) {
if (feof(file)) break;
if (ferror(file)) break;
}
// do stuff
}
The C specification says that int
must be able to hold values from -32767 to 32767 at a minimum. Any platform with a smaller int
is nonstandard.
The C specification also says that EOF
is a negative int
constant and that fgetc
returns "an unsigned char
converted to an int
" in the event of a successful read. Since unsigned char
can't have a negative value, the value of EOF
can be distinguished from anything read from the stream.*
*See below for a loophole case in which this fails to hold.
Relevant standard text (from C99):
§5.2.4.2.1 Sizes of integer types
<limits.h>
:[The] implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
[...]
- minimum value for an object of type
int
INT_MIN
-32767 - maximum value for an object of type
int
INT_MAX
+32767
- minimum value for an object of type
§7.19.1
<stdio.h>
- IntroductionEOF
... expands to an integer constant expression, with typeint
and a negative value, that is returned by several functions to indicate end-of-file, that is, no more input from a stream§7.19.7.1 The
fgets
functionIf the end-of-file indicator for the input stream pointed to by
stream
is not set and a next character is present, thefgetc
function obtains that character as anunsigned char
converted to anint
and advances the associated file position indicator for the stream (if defined)
If UCHAR_MAX
≤ INT_MAX
, there is no problem: all unsigned char
values will be converted to non-negative integers, so they will be distinct from EOF.
Now, there is a funny sort of loophole here: if a system has UCHAR_MAX
> INT_MAX
, then a system is legally allowed to convert values greater than INT_MAX
to negative integers (per §6.3.1.3, the result of converting a value to a signed type that cannot represent that value is implementation defined), making it possible for a character read from a stream to be converted to EOF.
Systems with CHAR_BIT > 8
do exist (e.g. the TI C4x DSP, which apparently uses 32-bit bytes), although I'm not sure if they are broken with respect to EOF and stream functions.
NOTE: chux's answer is the correct one in the most general case. I'm leaving this answer up because I believe both the answer and the discussion in the comments are valuable in understanding the (rare) situations in which chux's approach is necessary.
EOF is guaranteed to have a negative value (C99 7.19.1), and as you mentioned, fgetc reads its input as an unsigned char before converting to int. So those by themselves guarantee that EOF can't be read from a file.
As for your specific questions:
fgetc can't read a legitimate datum equal to EOF. In the file, there's no such thing as signed or unsigned; it's just bit sequences. It's C that interprets 1000 1111 differently depending on whether it's being treated as signed or unsigned. fgetc is required to treat it as unsigned, so negative numbers (other than EOF) cannot be returned.
Addendum: It can't read EOF for the unsigned char part, but when it converts the unsigned char to an int, if the int is not capable of representing all values of the unsigned char, then the behavior is implementation-defined (6.3.1.3).
fgetc is required by the standard for hosted implementations, but freestanding implementations are permitted to omit most of the standard library functions (some are apparently required, but I couldn't find the list.)
EOF won't require a long, since fgetc needs to be able to return it and fgetc returns an int.
As far as altering the data goes, it can't change the value exactly, but since fgetc is specified to read "characters" from the file as opposed to chars, it could potentially read in 8-bits at a time even if the system otherwise defines CHAR_BIT to be 16 (which is the minimum value it could have if sizeof(int) == 1, since INT_MIN <= -32767 and INT_MAX >= 32767 are required by 5.2.4.2). In that case, the input character would be converted to a unsigned char that just always had its high bits 0. Then it could make the conversion to int without losing precision. (In practice, this just won't come up, since machines don't generally have 16-bit bytes)
来源:https://stackoverflow.com/questions/32641375/is-it-possible-to-confuse-eof-with-a-normal-byte-value-when-using-fgetc