I want to fetch million of rows from a table between two timestamps and then do processing over it. Firing a single query and retrieving all the records at once looks to be a ba
Pagination pattern has been invented for the purpose of websites presentation (in opposite to scrolling navigation), and works best there. In short, the live user is practically unable to view thousands/millions of records at once, so the information is divided into short pages (50~200 records), where one query is usually sent to the database for each page. The user usually clicks on a few pages only, but does not browse all of them, in addition the user needs a bit of time to browse the page, so the queries are not sent to the database one by one, but in long intervals. The time to retrieve a chunk of data is much shorter than retrieving all millions of record, so the user is happy because he does not have to wait long for subsequent pages, and the overall system load is smaller.
But it seems from the question that the nature of your application is oriented to batch processing rather than to the web presentation. The application must fetch all records and do some operations/transformations (calculations) on each of the records. In this case , completely different design patterns are used (stream/pipelined processing, sequence of steps, parallel steps/operations etc), and pagination will not work, if you go that way you will kill your system performance.
Instead of fancy theory, let's look at simple and practical example which will show you what differences in speed we are talking here
Let say there is a table PAGINATION
with about 7 millions of records:
create table pagination as
select sysdate - 200 * dbms_random.value As my_date, t.*
from (
select o.* from all_objects o
cross join (select * from dual connect by level <= 100)
fetch first 10000000 rows only
) t;
select count(*) from pagination;
COUNT(*)
----------
7369600
Let say there is an index created on MY_DATE
column, and index statistics are fresh:
create index PAGINATION_IX on pagination( my_date );
BEGIN dbms_stats.gather_table_stats( 'TEST', 'PAGINATION', method_opt => 'FOR ALL COLUMNS' ); END;
/
Let say that we are going to process about 10% of records from the table between the below dates:
select count(*) from pagination
where my_date between date '2017-10-01' and '2017-10-21';
COUNT(*)
----------
736341
and finally let say that our "processing" for simplicity, will consist in simple summing of lengths of one of field.
This is a simple paging implementation:
public class Pagination {
public static class RecordPojo {
Date myDate;
String objectName;
public Date getMyDate() {
return myDate;
}
public RecordPojo setMyDate(Date myDate) {
this.myDate = myDate;
return this;
}
public String getObjectName() {
return objectName;
}
public RecordPojo setObjectName(String objectName) {
this.objectName = objectName;
return this;
}
};
static class MyPaginator{
private Connection conn;
private int pageSize;
private int currentPage = 0;
public MyPaginator( Connection conn, int pageSize ) {
this.conn = conn;
this.pageSize = pageSize;
}
static final String QUERY = ""
+ "SELECT my_date, object_name FROM pagination "
+ "WHERE my_date between date '2017-10-01' and '2017-10-21' "
+ "ORDER BY my_date "
+ "OFFSET ? ROWS FETCH NEXT ? ROWS ONLY";
List getNextPage() {
List list = new ArrayList<>();
ResultSet rs = null;
try( PreparedStatement ps = conn.prepareStatement(QUERY);) {
ps.setInt(1, pageSize * currentPage++ );
ps.setInt(2, pageSize);
rs = ps.executeQuery();
while( rs.next()) {
list.add( new RecordPojo().setMyDate(rs.getDate(1)).setObjectName(rs.getString(2)));
}
} catch (SQLException e) {
e.printStackTrace();
}finally {
try{rs.close();}catch(Exception e) {}
}
return list;
}
public int getCurrentPage() {
return currentPage;
}
}
public static void main(String ...x) throws SQLException {
OracleDataSource ds = new OracleDataSource();
ds.setURL("jdbc:oracle:thin:test/test@//localhost:1521/orcl");
long startTime = System.currentTimeMillis();
long value = 0;
int pageSize = 1000;
try( Connection conn = ds.getConnection();){
MyPaginator p = new MyPaginator(conn, pageSize);
List list;
while( ( list = p.getNextPage()).size() > 0 ) {
value += list.stream().map( y -> y.getObjectName().length()).mapToLong(Integer::longValue).sum();
System.out.println("Page: " + p.getCurrentPage());
}
System.out.format("==================\nValue = %d, Pages = %d, time = %d seconds", value, p.getCurrentPage(), (System.currentTimeMillis() - startTime)/1000);
}
}
}
A result is:
Value = 18312338, Pages = 738, time = 2216 seconds
Now let's test a very simple stream based solution - just take only one record, process it, discard it (freeing up memory), and take the next one.
public class NoPagination {
static final String QUERY = ""
+ "SELECT my_date, object_name FROM pagination "
+ "WHERE my_date between date '2017-10-01' and '2017-10-21' "
+ "ORDER BY my_date ";
public static void main(String[] args) throws SQLException {
OracleDataSource ds = new OracleDataSource();
ds.setURL("jdbc:oracle:thin:test/test@//localhost:1521/orcl");
long startTime = System.currentTimeMillis();
long count = 0;
ResultSet rs = null;
PreparedStatement ps = null;
try( Connection conn = ds.getConnection();){
ps = conn.prepareStatement(QUERY);
rs = ps.executeQuery();
while( rs.next()) {
// processing
RecordPojo r = new RecordPojo().setMyDate(rs.getDate(1)).setObjectName(rs.getString(2));
count+=r.getObjectName().length();
}
System.out.format("==================\nValue = %d, time = %d seconds", count, (System.currentTimeMillis() - startTime)/1000);
}finally {
try { rs.close();}catch(Exception e) {}
try { ps.close();}catch(Exception e) {}
}
}
A result is:
Value = 18312328, time = 11 seconds
Yes - 2216 seconds / 11 seconds = 201 times faster - 20 100 % faster !!!
Unbelievable ? You can test it yourself.
This example shows how important it is to choose the right solution (right design patterns) to solve the problem.