I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it\'
A little late here, there is now a few benchmarking projects for CSV parsers. Your selection will depend on the exact use-case (i.e. raw data vs data binding etc).
Have you seen Apache Commons CSV?
split
Bear in mind is that split
only returns a view of the data, meaning that the original line
object is not eligible for garbage collection whilst there is a reference to any of its views. Perhaps making a defensive copy will help? (Java bug report)
It also is not reliable in grouping escaped CSV columns containing commas
The problem of your code is that it's using replaceAll and split which are very costly operation. You should definitely consider using a csv parser/reader that would do a one pass parsing.
There is a benchmark on github
https://github.com/uniVocity/csv-parsers-comparison
that unfortunately is ran under java 6. The number are slightly different under java 7 and 8. I'm trying to get more detail data for different file size but it's work in progress
see https://github.com/arnaudroger/csv-parsers-comparison
The new kid on the block. It uses java annotations and is built on apache-csv which one of the faster libraries out there for csv parsing.
This library is also thread safe as well if you wanted to re-use the CSVProcessor you can and should.
Example:
Pojo
@CSVReadComponent(type = CSVType.NAMED)
@CSVWriteComponent(type = CSVType.ORDER)
public class Pojo {
@CSVWriteBinding(order = 0)
private String name;
@CSVWriteBinding(order = 1)
@CSVReadBinding(header = "age")
private Integer age;
@CSVWriteBinding(order = 2)
@CSVReadBinding(header = "money")
private Double money;
@CSVReadBinding(header = "name")
public void setA(String name) {
this.name = name;
}
@Override
public String toString() {
return "Name: " + name + System.lineSeparator() + "\tAge: " + age + System.lineSeparator() + "\tMoney: "
+ money;
}}
Main
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.*;
public class SimpleMain {
public static void main(String[] args) {
String csv = "name,age,money" + System.lineSeparator() + "Michael Williams,34,39332.15";
CSVProcessor processor = new CSVProcessor(Pojo.class);
List<Pojo> list = new ArrayList<>();
try {
list.addAll(processor.parse(new StringReader(csv)));
list.forEach(System.out::println);
System.out.println();
StringWriter sw = new StringWriter();
processor.write(list, sw);
System.out.println(sw.toString());
} catch (IOException e) {
}
}}
Since this is built on top of apache-csv you can use the powerful tool CSVFormat. Lets say the delimiter for the csv are pipes (|) instead of commas(,) you could for Example:
CSVFormat csvFormat = CSVFormat.DEFAULT.withDelimiter('|');
List<Pojo> list = processor.parse(new StringReader(csv), csvFormat);
Another benefit are inheritance is also consider.
For other examples on handling reading/writing non-primitive data
Take a look at opencsv.
This blog post, opencsv is an easy CSV parser, has example usage.
Apart from the suggestions made above, I think you can try improving your code by using some threading and concurrency.
Following is the brief analysis and suggested solution
Though the solution involves some effort, but at the end this will surly help you.