Why is BufferedReader read() much slower than readLine()?

后端 未结 6 766
不思量自难忘°
不思量自难忘° 2020-12-29 18:45

I need to read a file one character at a time and I\'m using the read() method from BufferedReader. *

I found that read() is a

相关标签:
6条回答
  • 2020-12-29 19:00

    Java JIT optimizes away empty loop bodies, so your loops actually look like this:

    while((c = fa.read()) != -1);
    

    and

    while((line = fa.readLine()) != null);
    

    I suggest you read up on benchmarking here and the optimization of the loops here.


    As to why the time taken differs:

    • Reason one (This only applies if the bodies of the loops contain code): In the first example, you're doing one operation per line, in the second, you're doing one per character. This this adds up the more lines/characters you have.

      while((c = fa.read()) != -1){
          //One operation per character.
      }
      
      while((line = fa.readLine()) != null){
          //One operation per line.
      }
      
    • Reason two: In the class BufferedReader, the method readLine() doesn't use read() behind the scenes - it uses its own code. The method readLine() does less operations per character to read a line, than it would take to read a line with the read() method - this is why readLine() is faster at reading an entire file.

    • Reason three: It takes more iterations to read each character, than it does to read each line (unless each character is on a new line); read() is called more times than readLine().

    0 讨论(0)
  • 2020-12-29 19:03

    The important thing when analyzing performance is to have a valid benchmark before you start. So let's start with a simple JMH benchmark that shows what our expected performance after warmup would be.

    One thing we have to consider is that since modern operating systems like to cache file data that is accessed regularly we need some way to clear the caches between tests. On Windows there's a small little utility that does just this - on Linux you should be able to do it by writing to some pseudo file somewhere.

    The code then looks as follows:

    import org.openjdk.jmh.annotations.Benchmark;
    import org.openjdk.jmh.annotations.BenchmarkMode;
    import org.openjdk.jmh.annotations.Fork;
    import org.openjdk.jmh.annotations.Mode;
    
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    
    @BenchmarkMode(Mode.AverageTime)
    @Fork(1)
    public class IoPerformanceBenchmark {
        private static final String FILE_PATH = "test.fa";
    
        @Benchmark
        public int readTest() throws IOException, InterruptedException {
            clearFileCaches();
            int result = 0;
            try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
                int value;
                while ((value = reader.read()) != -1) {
                    result += value;
                }
            }
            return result;
        }
    
        @Benchmark
        public int readLineTest() throws IOException, InterruptedException {
            clearFileCaches();
            int result = 0;
            try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
                String line;
                while ((line = reader.readLine()) != null) {
                    result += line.chars().sum();
                }
            }
            return result;
        }
    
        private void clearFileCaches() throws IOException, InterruptedException {
            ProcessBuilder pb = new ProcessBuilder("EmptyStandbyList.exe", "standbylist");
            pb.inheritIO();
            pb.start().waitFor();
        }
    }
    

    and if we run it with

    chcp 65001 # set codepage to utf-8
    mvn clean install; java "-Dfile.encoding=UTF-8" -server -jar .\target\benchmarks.jar
    

    we get the following results (about 2 seconds are needed to clear the caches for me and I'm running this on a HDD so that's why it's a good deal slower than for you):

    Benchmark                            Mode  Cnt  Score   Error  Units
    IoPerformanceBenchmark.readLineTest  avgt   20  3.749 ± 0.039   s/op
    IoPerformanceBenchmark.readTest      avgt   20  3.745 ± 0.023   s/op
    

    Surprise! As expected there's no performance difference here at all after the JVM has settled into a stable mode. But there is one outlier in the readCharTest method:

    # Warmup Iteration   1: 6.186 s/op
    # Warmup Iteration   2: 3.744 s/op
    

    which is exaclty the problem you're seeing. The most likely reason I can think of is that OSR isn't doing a good job here or that the JIT is only running too late to make a difference on the first iteration.

    Depending on your use case this might be a big problem or negligible (if you're reading a thousand files it won't matter, if you're only reading one this is a problem).

    Solving such a problem is not easy and there are no general solutions, although there are ways to handle this. One easy test to see if we're on the right track is to run the code with the -Xcomp option which forces HotSpot to compile every method on the first invocation. And indeed doing so, causes the large delay at the first invocation to disappear:

    # Warmup Iteration   1: 3.965 s/op
    # Warmup Iteration   2: 3.753 s/op
    

    Possible solution

    Now that we have a good idea what the actual problem is (my guess is still all those locks neither being coalesced nor using the efficient biased locks implementation), the solution is rather straight forward and simple: Reduce the number of function calls (so yes we could've arrived at this solution without everything above, but it's always nice to have a good grip on the problem and there might have been a solution that didn't involve changing much code).

    The following code runs consistently faster than either of the other two - you can play with the array size but it's surprisingly unimportant (presumably because contrary to the other methods read(char[]) does not have to acquire a lock so the cost per call is lower to begin with).

    private static final int BUFFER_SIZE = 256;
    private char[] arr = new char[BUFFER_SIZE];
    
    @Benchmark
    public int readArrayTest() throws IOException, InterruptedException {
        clearFileCaches();
        int result = 0;
        try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
            int charsRead;
            while ((charsRead = reader.read(arr)) != -1) {
                for (int i = 0; i < charsRead; i++) {
                    result += arr[i];
                }
            }
        }
        return result;
    } 
    

    This is most likely good enough performance wise, but if you wanted to improve performance even further using a file mapping might (wouldn't count on too large an improvement in a case such as this, but if you know that your text is always ASCII, you could make some further optimizations) further help performance.

    0 讨论(0)
  • 2020-12-29 19:12

    It is not surprising to see this difference if you think about it. One test is iterating the lines in a text file, while the other is iterating characters.

    Unless each line contains one character, it is expected that the readLine() is way faster than the read() method.(although as pointed out by the comments above, it is arguable since a BufferedReader buffers the input, while the physical file reading might not be the only performance taking operation)

    If you really want to test the difference between the 2 I would suggest a setup where you iterate over each character in both tests. E.g. something like:

    void readTest(BufferedReader r)
    {
        int c;
        StringBuilder b = new StringBuilder();
        while((c = r.read()) != -1)
            b.append((char)c);
    }
    
    void readLineTest(BufferedReader r)
    {
        String line;
        StringBuilder b = new StringBuilder();
        while((line = b.readLine())!= null)
            for(int i = 0; i< line.length; i++)
                b.append(line.charAt(i));
    }
    

    Besides the above, please use a "Java performance diagnostic tool" to benchmark your code. Also, readup on how to microbenchmark java code.

    0 讨论(0)
  • 2020-12-29 19:13

    Thanks @Voo for the correction. What I mentioned below is correct from FileReader#read() v/s BufferedReader#readLine() point of view BUT not correct from BufferedReader#read() v/s BufferedReader#readLine() point of view, so I have striked-out the answer.

    Using read() method on BufferedReader is not a good idea, it wouldn't cause you any harm but it certainly wastes the purpose of class.

    Whole purpose in life of BufferedReader is to reduce the i/o by buffering the content. You can read here in Java tutorials. You may also notice that read() method in BufferedReader is actually inherited from Reader while readLine() is BufferedReader's own method.

    If you want to use read() method then I would say you better use FileReader, which is meant for that purpose. You can read here in Java tutorials.

    So, I think answer to your question is very simple (without going into bench-marking and all that explainations) -

    • Each read() is handled by underlying OS and triggers disk access, network activity, or some other operation that is relatively expensive.
    • When you use readLine() then you save all these overheads, so readLine() will always be faster than read(), may not be substantially for small data but faster.
    0 讨论(0)
  • 2020-12-29 19:20

    So this is the practical answer to my own question: Don't use BufferedReader.read() use FileChannel instead. (Obviously I'm not answering the WHY I put in the title). Here's the quick and dirty benchmark, hopefully others will find it useful:

    @Test
    public void testFileChannel() throws IOException{
    
        FileChannel fileChannel = FileChannel.open(Paths.get("chr1.fa"));
        long n= 0;
        int noOfBytesRead = 0;
    
        long t0= System.nanoTime();
    
        while(noOfBytesRead != -1){
            ByteBuffer buffer = ByteBuffer.allocate(10000);
            noOfBytesRead = fileChannel.read(buffer);
            buffer.flip();
            while ( buffer.hasRemaining() ) {
                char x= (char)buffer.get();
                n++;
            }
        }
        long t1= System.nanoTime();
        System.err.println((float)(t1-t0) / 1e6); // ~ 250 ms
        System.err.println("nchars: " + n); // 254235640 chars read
    }
    

    With ~250 ms to read the whole file char by char, this strategy is considerably faster than BufferedReader.readLine() (~700 ms), let alone read(). Adding if statements in the loop to check for x == '\n' and x == '>' makes little difference. Also putting a StringBuilder to reconstruct lines doesn't affect the timing too much. So this is plenty good for me (at least for now).

    Thanks to @Marco13 for mentioning FileChannel.

    0 讨论(0)
  • 2020-12-29 19:24

    According to the documentation:

    Every read() method call makes an expensive system call.

    Every readLine() method call still makes an expensive system call, however, for more bytes at once, so there are fewer calls.

    Similar situation happens when we make database update command for each record we want to update, versus a batch update, where we make one call for all the records.

    0 讨论(0)
提交回复
热议问题