How to split a CSV file into multiple chunks and read those chunks in parallel in Java code

后端 未结 6 2277
别那么骄傲
别那么骄傲 2021-01-02 05:02

I have a very big CSV file (1GB+), it has 100,000 line.

I need to write a Java program to parse each line from the CSV file to create a body for a HTTP request to s

相关标签:
6条回答
  • 2021-01-02 05:35

    If you're looking to unzip and parse in the same operation, have a look at https://github.com/skjolber/unzip-csv.

    0 讨论(0)
  • Java 8, which is scheduled for release this month, will have improved support for this through parallel streams and lambdas. Oracle's tutorial on parallel streams might be a good starting point.

    Note that a pitfall here is too much parallelism. For the example of retrieving URL's, it is likely a good idea to have a low number of parallel calls. Too much parallelism can affect not only bandwidth and the web site you are connecting to, but you will also risk running out of file descriptors, which is a strictly limited resource in most environments where java runs.

    Some frameworks that may help you are Netflix' RxJava and Akka. Be aware that these frameworks are not trivial and will take some effort to learn.

    0 讨论(0)
  • 2021-01-02 05:40

    Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).

    Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processing of these lines. A singe thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.

    0 讨论(0)
  • 2021-01-02 05:45

    Read CSV file in single thread once you get the line delegate this line to one of the Thread available in pool by constructing the object of your Runnable Task and pass it to Executors's submit() ,that will be executed asynchronously .

     public static void main(String[] args) throws IOException {
    
          String fName = "C:\\Amit\\abc.csv";
          String thisLine;
          FileInputStream fis = new FileInputStream(fName);
          DataInputStream myInput = new DataInputStream(fis);
          ExecutorService pool=Executors.newFixedThreadPool(1000);
          int count = 0;  // Concurrent request to Server barrier
    
          while ((thisLine = myInput.readLine()) != null) {
              if (count > 150) {
                  try {
                      Thread.sleep(100);
                      count = 0;
                  } catch (InterruptedException e) {
                      // TODO Auto-generated catch block
                      e.printStackTrace();
                  }
              }
    
              pool.submit(new MyTask(thisLine));
              count++;
          }
    
        }
    }
    

    Here your Task:

    class MyTask implements Runnable {
          private String lLine;
          public MyTask(String line) {
               this.lLine=line;
    
          }
    
          public void run() {
              // 1) Create Request  lLine
              // 2) send the HTTP request out and receive response
          }
    }
    
    0 讨论(0)
  • 2021-01-02 05:48

    Have one thread reading the file line by line and for every line read, post a task into an ExecutorService to perform the HTTP request for each one.

    Reading the file from multiple threads isn't going to work, as in order to read the nth line, you have to read all the others first. (It could work in theory if your file contained fixed width records, but CSV isn't a fixed width format.)

    0 讨论(0)
  • 2021-01-02 06:00

    You can have a thread which reads the lines of the CSV and builds a List of lines read. When this reaches some limit e.g. 100 lines to pass this to a fixed size thread pool to send as a request.

    I suspect that unless your server has 1000 cores, you might find that using 10-100 concurrent requests is faster.

    0 讨论(0)
提交回复
热议问题