Efficiently adding huge amounts of data from CSV files into an SQLite DB in Java [duplicate]

谁都会走 提交于 2019-12-24 07:03:37

问题


I'm trying to parse values from a CSV file to a SQLite DB, however the file is quite large (~2,500,000 lines). I ran my program for a a few hours, printing where it was up to, but by my calculation, the file would have taken about 100 hours to parse completely, so I stopped it.

I'm going to have to run this program as a background process at least once a week, on a new CSV file that is around 90% similar to the previous one. I have come up with a few solutions to improve my program. However I don't know much about databases, so I have questions about each of my solutions.

  • Is there a more efficient way to read a CSV file than what I have already?

  • Is instantiating an ObjectOutputStream, and storing it as a BLOB significantly computationally expensive? I could directly add the values instead, but I use the BLOB later, so storing it now saves me from instantiating a new one multiple times.

  • Would connection pooling, or changing the way I use the Connection in some other way be more efficient?

  • I'm setting the URL column as UNIQUE so I can use INSERT OR IGNORE, but testing this on smaller datasets(~10000 lines) indicates that there is no performance gain compared to dropping the table and repopulating. Is there a faster way to add only unique values?

  • Are there any obvious mistakes I'm making? (Again, I know very little about databases)

    public class Database{
    
    public void createResultsTable(){
        Statement stmt;
        String sql = "CREATE TABLE results("
                + "ID       INTEGER     NOT NULL    PRIMARY KEY AUTOINCREMENT, "
                + "TITLE    TEXT        NOT NULL, "
                + "URL      TEXT        NOT NULL    UNIQUE, "
                ...
                ...
                + "SELLER   TEXT        NOT NULL, "
                + "BEAN     BLOB);";
        try {
            stmt = c.createStatement();
            stmt.executeUpdate(sql);
        } catch (SQLException e) { e.printStackTrace();}
    
    
    }
    
    
    public void addCSVToDatabase(Connection conn, String src){
    
        BufferedReader reader = null;
        DBEntryBean b;
        String[] vals;
    
        try{
            reader = new BufferedReader(new InputStreamReader(new FileInputStream(src), "UTF-8"));
            for(String line; (line = reader.readLine()) != null;){
                //Each line takes the form: "title|URL|...|...|SELLER"
                vals = line.split("|");
    
                b = new DBEntryBean();
                b.setTitle(vals[0]);
                b.setURL(vals[1]);
                ...
                ...
                b.setSeller(vals[n]);
    
                insert(conn, b);
            }
        } catch(){
    
        }
    }
    
    
    public void insert(Connection conn, DBEntryBean b){
    
        PreparedStatement pstmt = null;
        String sql = "INSERT OR IGNORE INTO results("
                + "TITLE, "
                + "URL, "
                ...
                ...
                + "SELLER, "
                + "BEAN"
                + ");";
    
        try {
            pstmt = c.prepareStatement(sql);
            pstmt.setString(Constants.DB_COL_TITLE, b.getTitle());      
            pstmt.setString(Constants.DB_COL_URL, b.getURL());      
            ...
            ...
            pstmt.setString(Constants.DB_COL_SELLER, b.getSeller());
    
            // ByteArrayOutputStream baos = new ByteArrayOutputStream();
            // oos = new ObjectOutputStream(baos);
            // oos.writeObject(b);
            // byte[] bytes = baos.toByteArray();
            // pstmt.setBytes(Constants.DB_COL_BEAN, bytes);
            pstmt.executeUpdate();
    
        } catch (SQLException e) { e.printStackTrace(); 
        } finally{
            if(pstmt != null){
                try{ pstmt.close(); }
                catch (SQLException e) { e.printStackTrace(); }
            }
    
        }
    }
    
    
    }
    

回答1:


The biggest bottleck in your code is that you are not batching the insert operations. You should really call pstmt.addBatch(); instead of pstmt.executeUpdate(); and execute the batch once you have something like a batch of 10K rows to insert.

On the CSV parsing side should really consider using a csv library to do the parsing for you. Univocity-parsers has the fastest CSV parser around and it should process these 2.5 million lines in less than a second. I'm the author of this library by the way.

String.split() is convenient but not fast. For anything more than a few dozen rows it doesn't make sense to use this.

Hope this helps.



来源:https://stackoverflow.com/questions/41517896/efficiently-adding-huge-amounts-of-data-from-csv-files-into-an-sqlite-db-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!