Fast CSV parsing

前端 未结 9 1339
名媛妹妹
名媛妹妹 2020-11-28 10:25

I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it\'

相关标签:
9条回答
  • 2020-11-28 11:03

    A little late here, there is now a few benchmarking projects for CSV parsers. Your selection will depend on the exact use-case (i.e. raw data vs data binding etc).

    • SimpleFlatMapper
    • uniVocity
    • sesseltjonna-csv (disclaimer: I wrote this parser)
    0 讨论(0)
  • 2020-11-28 11:06

    Apache Commons CSV

    Have you seen Apache Commons CSV?

    Caveat On Using split

    Bear in mind is that split only returns a view of the data, meaning that the original line object is not eligible for garbage collection whilst there is a reference to any of its views. Perhaps making a defensive copy will help? (Java bug report)

    It also is not reliable in grouping escaped CSV columns containing commas

    0 讨论(0)
  • 2020-11-28 11:11

    The problem of your code is that it's using replaceAll and split which are very costly operation. You should definitely consider using a csv parser/reader that would do a one pass parsing.

    There is a benchmark on github

    https://github.com/uniVocity/csv-parsers-comparison

    that unfortunately is ran under java 6. The number are slightly different under java 7 and 8. I'm trying to get more detail data for different file size but it's work in progress

    see https://github.com/arnaudroger/csv-parsers-comparison

    0 讨论(0)
  • 2020-11-28 11:18

    Quirk-CSV


    The new kid on the block. It uses java annotations and is built on apache-csv which one of the faster libraries out there for csv parsing.

    This library is also thread safe as well if you wanted to re-use the CSVProcessor you can and should.

    Example:

    Pojo

    @CSVReadComponent(type = CSVType.NAMED)
    @CSVWriteComponent(type = CSVType.ORDER)
    public class Pojo {
        @CSVWriteBinding(order = 0)
        private String name;
    
        @CSVWriteBinding(order = 1)
        @CSVReadBinding(header = "age")
        private Integer age;
    
        @CSVWriteBinding(order = 2)
        @CSVReadBinding(header = "money")
        private Double money;
    
        @CSVReadBinding(header = "name")
        public void setA(String name) {
            this.name = name;
        }
    
        @Override
        public String toString() {
    
        return "Name: " + name + System.lineSeparator() + "\tAge: " + age + System.lineSeparator() + "\tMoney: "
                + money;
    }}
    

    Main

    import java.io.IOException;
    import java.io.StringReader;
    import java.io.StringWriter;
    import java.util.*;
    
    
    public class SimpleMain {
    public static void main(String[] args) {
        String csv = "name,age,money" + System.lineSeparator() + "Michael Williams,34,39332.15";
    
        CSVProcessor processor = new CSVProcessor(Pojo.class);
        List<Pojo> list = new ArrayList<>();
        try {
            list.addAll(processor.parse(new StringReader(csv)));
            list.forEach(System.out::println);
    
            System.out.println();
    
            StringWriter sw = new StringWriter();
            processor.write(list, sw);
            System.out.println(sw.toString());
        } catch (IOException e) {
        }
    
    
    }}
    

    Since this is built on top of apache-csv you can use the powerful tool CSVFormat. Lets say the delimiter for the csv are pipes (|) instead of commas(,) you could for Example:

    CSVFormat csvFormat = CSVFormat.DEFAULT.withDelimiter('|');
    List<Pojo> list = processor.parse(new StringReader(csv), csvFormat);
    

    Another benefit are inheritance is also consider.

    For other examples on handling reading/writing non-primitive data

    0 讨论(0)
  • 2020-11-28 11:21

    opencsv

    Take a look at opencsv.

    This blog post, opencsv is an easy CSV parser, has example usage.

    0 讨论(0)
  • 2020-11-28 11:21

    Apart from the suggestions made above, I think you can try improving your code by using some threading and concurrency.

    Following is the brief analysis and suggested solution

    1. From the code it seems that you are reading the data over the network (most possibly apache-common-httpclient lib).
    2. You need to make sure that bottleneck that you are saying is not in the data transfer over the network.
    3. One way to see is just dump the data in some file (without parsing) and see how much does it take. This will give you an idea how much time is actually spent in parsing (when compared to current observation).
    4. Now have a look at how java.util.concurrent package is used. Some of the link that you can use are (1,2)
    5. What you ca do is the tasks that you are doing in for loop can be executed in a thread.
    6. Using the threadpool and concurrency will greatly improve your performance.

    Though the solution involves some effort, but at the end this will surly help you.

    0 讨论(0)
提交回复
热议问题