问题
I'm trying to parse values from a CSV file to a SQLite DB, however the file is quite large (~2,500,000 lines). I ran my program for a a few hours, printing where it was up to, but by my calculation, the file would have taken about 100 hours to parse completely, so I stopped it.
I'm going to have to run this program as a background process at least once a week, on a new CSV file that is around 90% similar to the previous one. I have come up with a few solutions to improve my program. However I don't know much about databases, so I have questions about each of my solutions.
Is there a more efficient way to read a CSV file than what I have already?
Is instantiating an ObjectOutputStream, and storing it as a BLOB significantly computationally expensive? I could directly add the values instead, but I use the BLOB later, so storing it now saves me from instantiating a new one multiple times.
Would connection pooling, or changing the way I use the Connection in some other way be more efficient?
I'm setting the URL column as UNIQUE so I can use INSERT OR IGNORE, but testing this on smaller datasets(~10000 lines) indicates that there is no performance gain compared to dropping the table and repopulating. Is there a faster way to add only unique values?
Are there any obvious mistakes I'm making? (Again, I know very little about databases)
public class Database{ public void createResultsTable(){ Statement stmt; String sql = "CREATE TABLE results(" + "ID INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, " + "TITLE TEXT NOT NULL, " + "URL TEXT NOT NULL UNIQUE, " ... ... + "SELLER TEXT NOT NULL, " + "BEAN BLOB);"; try { stmt = c.createStatement(); stmt.executeUpdate(sql); } catch (SQLException e) { e.printStackTrace();} } public void addCSVToDatabase(Connection conn, String src){ BufferedReader reader = null; DBEntryBean b; String[] vals; try{ reader = new BufferedReader(new InputStreamReader(new FileInputStream(src), "UTF-8")); for(String line; (line = reader.readLine()) != null;){ //Each line takes the form: "title|URL|...|...|SELLER" vals = line.split("|"); b = new DBEntryBean(); b.setTitle(vals[0]); b.setURL(vals[1]); ... ... b.setSeller(vals[n]); insert(conn, b); } } catch(){ } } public void insert(Connection conn, DBEntryBean b){ PreparedStatement pstmt = null; String sql = "INSERT OR IGNORE INTO results(" + "TITLE, " + "URL, " ... ... + "SELLER, " + "BEAN" + ");"; try { pstmt = c.prepareStatement(sql); pstmt.setString(Constants.DB_COL_TITLE, b.getTitle()); pstmt.setString(Constants.DB_COL_URL, b.getURL()); ... ... pstmt.setString(Constants.DB_COL_SELLER, b.getSeller()); // ByteArrayOutputStream baos = new ByteArrayOutputStream(); // oos = new ObjectOutputStream(baos); // oos.writeObject(b); // byte[] bytes = baos.toByteArray(); // pstmt.setBytes(Constants.DB_COL_BEAN, bytes); pstmt.executeUpdate(); } catch (SQLException e) { e.printStackTrace(); } finally{ if(pstmt != null){ try{ pstmt.close(); } catch (SQLException e) { e.printStackTrace(); } } } } }
回答1:
The biggest bottleck in your code is that you are not batching the insert operations. You should really call pstmt.addBatch();
instead of pstmt.executeUpdate();
and execute the batch once you have something like a batch of 10K rows to insert.
On the CSV parsing side should really consider using a csv library to do the parsing for you. Univocity-parsers has the fastest CSV parser around and it should process these 2.5 million lines in less than a second. I'm the author of this library by the way.
String.split()
is convenient but not fast. For anything more than a few dozen rows it doesn't make sense to use this.
Hope this helps.
来源:https://stackoverflow.com/questions/41517896/efficiently-adding-huge-amounts-of-data-from-csv-files-into-an-sqlite-db-in-java