How to read files in multithreaded mode?

后端 未结 4 1588
深忆病人
深忆病人 2021-01-06 00:40

I currently have a program that reads file (very huge) in single threaded mode and creates search index but it takes too long to index in single threaded environment.

<
相关标签:
4条回答
  • 2021-01-06 00:51

    Your bottleneck is most likely the indexing, not the file reading. assuming your indexing system supports multiple threads, you probably want a producer/consumer setup with one thread reading the file and pushing each line into a BlockingQueue (the producer), and multiple threads pulling lines from the BlockingQueue and pushing them into the index (the consumers).

    0 讨论(0)
  • 2021-01-06 00:57

    If you can use Java 8, you may be able to do this quickly and easily using the Streams API. Read the file into a MappedByteBuffer, which can open a file up to 2GB very quicky, then read the lines out of the buffer (you need to make sure your JVM has enough extra memory to hold the file):

    package com.objective.stream;
    
    import java.io.BufferedReader;
    import java.io.ByteArrayInputStream;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.nio.MappedByteBuffer;
    import java.nio.channels.FileChannel;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    import java.util.stream.Stream;
    
    public class StreamsFileProcessor {
        private MappedByteBuffer buffer;
    
        public static void main(String[] args){
            if (args[0] != null){
                Path myFile = Paths.get(args[0]);
                StreamsFileProcessor proc = new StreamsFileProcessor();
                try {
                    proc.process(myFile);
                } catch (IOException e) {
                    e.printStackTrace();
                }   
            }
        }
    
        public void process(Path file) throws IOException {
            readFileIntoBuffer(file);
            getBufferStream().parallel()
                .forEach(this::doIndex);
        }
    
        private Stream<String> getBufferStream() throws IOException {
            try (BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(buffer.array())))){
                return reader.lines();
            }
        }
    
        private void readFileIntoBuffer(Path file) throws IOException{
            try(FileInputStream fis = new FileInputStream(file.toFile())){
                FileChannel channel = fis.getChannel();
                buffer = channel.map(FileChannel.MapMode.PRIVATE, 0, channel.size());
            }
        }
    
        private void doIndex(String s){
            // Do whatever I need to do to index the line here
        }
    }
    
    0 讨论(0)
  • 2021-01-06 01:02

    See this thread - if your files are all on the same disk then you can't do better than reading them with a single thread, although it may be possible to process the files with multiple threads once you've read them into main memory.

    0 讨论(0)
  • 2021-01-06 01:02

    First, I agree with @Zim-Zam that it is the file IO, not the indexing, that is likely the rate determining step. (So I disagree with @jtahlborn). Depends on how complex the indexing is.

    Second, in your code, each thread has it's own, independent BufferedReader. Therefore they will all read the entire file. One possible fix is to use a single BufferedReader that they share. And then you need to synchronize the BufferedReader.readLine() method (I think) since the javadocs are silent on whether BufferedReader is thread-safe. And, since I think the IO is the botleneck, this will become the bottleneck and I doubt if multithreading will gain you much. But give it a try, I have been wrong occasionally. :-)

    p.s. I agree with @jtahlmorn that a producer/consumer pattern is better than my share the BufferedReader idea, but that would be much more work for you.

    0 讨论(0)
提交回复
热议问题