问题
Assume a client application that uses a FileSplit
object in order to read the actual bytes from the corresponding file.
To do so, an InputStream
object has to be created from the FileSplit
, via code like:
FileSplit split = ... // The FileSplit reference
FileSystem fs = ... // The HDFS reference
FSDataInputStream fsin = fs.open(split.getPath());
long start = split.getStart()-1; // Byte before the first
if (start >= 0)
{
fsin.seek(start);
}
The adjustment of the stream by -1 is present in some scenarios like the Hadoop
MapReduce
LineRecordReader
class. However, the documentation of the FSDataInputStream
seek()
method says explicitly that, after seeking to a location, the next read will be from that location, meaning (?) that the code above will be 1 byte off (?).
So, the question is, would that "-1" adjustment be necessary for all InputSplit reading cases?
By the way, if one wants to read a FileSplit
correctly, seeking to its start is not enough, because every split also has an end that may not be identical to the end of the actual HDFS file. So, the corresponding InputStream
should be "bounded", i.e. have a maximum length, like the following:
InputStream is = new BoundedInputStream(fsin, split.getLength());
In this case, after the "native" fsin
steam has been created above, the org.apache.commons.io.input.BoundedInputStream
class is used, to implement the "bounding".
UPDATE
Apparently the adjustment is necessary only for use cases line the one of the LineRecordReader
class, which exceeds the boundaries of a split to make sure that it reads the full last line.
A good discussion with more details on this can be found in an earlier question and in the comments for MAPREDUCE-772.
回答1:
Seeking to position 0 will mean the next call to InputStream.read() will read byte 0. Seeking to position -1 will most probably throw an exception.
Where specifically are you referring to when you talk about the standard pattern in examples and source code?
Splits are not neccessarily bounded as you note - take TextInputFormat for example and files that can be split. The record reader that processes the split will:
- Seek to the start index, then find the next newline character
- Find the next newline character (or EOF) and return that 'line' as the next record
This repeats until either the next newline found is past the end of the split, or the EOF is found. So you see that i this case the actual bounds of a split might be right shifted from that given by the Input split
Update
Referencing this code block from LineRecordReader:
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
The --start
statement is most probably to deal with avoiding the split starting on a newline character and returning an empty line as the first record. You can see that if the seek occurs, the first line is skipped to ensure the file splits don't return overlapping records
来源:https://stackoverflow.com/questions/16180130/hadoop-filesplit-reading