Good and effective CSV/TSV Reader for Java

后端 未结 4 1253
南笙
南笙 2021-01-11 09:27

I am trying to read big CSV and TSV (tab-separated) Files with about 1000000 rows or more. Now I tried to read a TSV cont

相关标签:
4条回答
  • 2021-01-11 10:06

    I don't know if that question is still active but here is the one I use successfully. Still may have to implement more interfaces such as Stream or Iterable, however:

    import java.io.Closeable;
    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.io.InputStream;
    import java.util.Scanner;
    
    /** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
    public class TSVReader implements Closeable 
    {
        final Scanner in;
        String peekLine = null;
    
        public TSVReader(InputStream stream) throws FileNotFoundException
        {
            in = new Scanner(stream);
        }
    
        /**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
        public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}
    
        public boolean hasNextTokens()
        {
            if(peekLine!=null) return true;
            if(!in.hasNextLine()) {return false;}
            String line = in.nextLine().trim();
            if(line.isEmpty())  {return hasNextTokens();}
            this.peekLine = line;       
            return true;        
        }
    
        public String[] nextTokens()
        {
            if(!hasNextTokens()) return null;       
            String[] tokens = peekLine.split("[\\s\t]+");
    //      System.out.println(Arrays.toString(tokens));
            peekLine=null;      
            return tokens;
        }
    
        @Override public void close() throws IOException {in.close();}
    }
    
    0 讨论(0)
  • 2021-01-11 10:12

    Try switching libraries as suggested by Satish. If that doesn't help, you have to split the whole file into tokens and process them.

    Thinking that your CSV didn't had any escape characters for commas

    // r is the BufferedReader pointed at your file
    String line;
    StringBuilder file = new StringBuilder();
    // load each line and append it to file.
    while ((line=r.readLine())!=null){
        file.append(line);
    }
    // Make them to an array
    String[] tokens = file.toString().split(",");
    

    Then you can process it. Don't forget to trim the token before using it.

    0 讨论(0)
  • 2021-01-11 10:24

    I have not tried it, but I had investigated superCSV earlier.

    http://sourceforge.net/projects/supercsv/

    http://supercsv.sourceforge.net/

    Check if that works for you, 2.5 million lines.

    0 讨论(0)
  • 2021-01-11 10:28

    Do not use a CSV parser to parse TSV inputs. It will break if the TSV has fields with a quote character, for example.

    uniVocity-parsers comes with a TSV parser. You can parse a billion rows without problems.

    Example to parse a TSV input:

    TsvParserSettings settings = new TsvParserSettings();
    TsvParser parser = new TsvParser(settings);
    
    // parses all rows in one go.
    List<String[]> allRows = parser.parseAll(new FileReader(yourFile));
    

    If your input is so big it can't be kept in memory, do this:

    TsvParserSettings settings = new TsvParserSettings();
    
    // all rows parsed from your input will be sent to this processor
    ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
        @Override
        public void rowProcessed(Object[] row, ParsingContext context) {
            //here is the row. Let's just print it.
            System.out.println(Arrays.toString(row));
        }
    };
    // the ObjectRowProcessor supports conversions from String to whatever you need:
    // converts values in columns 2 and 5 to BigDecimal
    rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);
    
    // converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
    rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");
    
    //configures to use the RowProcessor
    settings.setRowProcessor(rowProcessor);
    
    TsvParser parser = new TsvParser(settings);
    //parses everything. All rows will be pumped into your RowProcessor.
    parser.parse(new FileReader(yourFile));
    

    Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

    0 讨论(0)
提交回复
热议问题