I am trying to read big CSV
and TSV
(tab-separated) Files with about 1000000
rows or more. Now I tried to read a TSV
cont
I don't know if that question is still active but here is the one I use successfully. Still may have to implement more interfaces such as Stream or Iterable, however:
import java.io.Closeable;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;
/** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
public class TSVReader implements Closeable
{
final Scanner in;
String peekLine = null;
public TSVReader(InputStream stream) throws FileNotFoundException
{
in = new Scanner(stream);
}
/**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}
public boolean hasNextTokens()
{
if(peekLine!=null) return true;
if(!in.hasNextLine()) {return false;}
String line = in.nextLine().trim();
if(line.isEmpty()) {return hasNextTokens();}
this.peekLine = line;
return true;
}
public String[] nextTokens()
{
if(!hasNextTokens()) return null;
String[] tokens = peekLine.split("[\\s\t]+");
// System.out.println(Arrays.toString(tokens));
peekLine=null;
return tokens;
}
@Override public void close() throws IOException {in.close();}
}
Try switching libraries as suggested by Satish
. If that doesn't help, you have to split the whole file into tokens and process them.
Thinking that your CSV
didn't had any escape characters for commas
// r is the BufferedReader pointed at your file
String line;
StringBuilder file = new StringBuilder();
// load each line and append it to file.
while ((line=r.readLine())!=null){
file.append(line);
}
// Make them to an array
String[] tokens = file.toString().split(",");
Then you can process it. Don't forget to trim the token before using it.
I have not tried it, but I had investigated superCSV earlier.
http://sourceforge.net/projects/supercsv/
http://supercsv.sourceforge.net/
Check if that works for you, 2.5 million lines.
Do not use a CSV parser to parse TSV inputs. It will break if the TSV has fields with a quote character, for example.
uniVocity-parsers comes with a TSV parser. You can parse a billion rows without problems.
Example to parse a TSV input:
TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));
If your input is so big it can't be kept in memory, do this:
TsvParserSettings settings = new TsvParserSettings();
// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
@Override
public void rowProcessed(Object[] row, ParsingContext context) {
//here is the row. Let's just print it.
System.out.println(Arrays.toString(row));
}
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);
// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");
//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);
TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).