Oracle Pagination strategy

后端 未结 2 657
盖世英雄少女心
盖世英雄少女心 2021-01-24 10:01

I want to fetch million of rows from a table between two timestamps and then do processing over it. Firing a single query and retrieving all the records at once looks to be a ba

相关标签:
2条回答
  • 2021-01-24 10:54

    Pagination pattern has been invented for the purpose of websites presentation (in opposite to scrolling navigation), and works best there. In short, the live user is practically unable to view thousands/millions of records at once, so the information is divided into short pages (50~200 records), where one query is usually sent to the database for each page. The user usually clicks on a few pages only, but does not browse all of them, in addition the user needs a bit of time to browse the page, so the queries are not sent to the database one by one, but in long intervals. The time to retrieve a chunk of data is much shorter than retrieving all millions of record, so the user is happy because he does not have to wait long for subsequent pages, and the overall system load is smaller.


    But it seems from the question that the nature of your application is oriented to batch processing rather than to the web presentation. The application must fetch all records and do some operations/transformations (calculations) on each of the records. In this case , completely different design patterns are used (stream/pipelined processing, sequence of steps, parallel steps/operations etc), and pagination will not work, if you go that way you will kill your system performance.


    Instead of fancy theory, let's look at simple and practical example which will show you what differences in speed we are talking here

    Let say there is a table PAGINATION with about 7 millions of records:

    create table pagination as
    select sysdate - 200 * dbms_random.value As my_date, t.*
    from (
        select o.* from all_objects o 
        cross join (select * from dual connect by level <= 100)
        fetch first 10000000 rows only
    ) t;
    
    select count(*) from pagination;
    
      COUNT(*)
    ----------
       7369600
    

    Let say there is an index created on MY_DATE column, and index statistics are fresh:

    create index PAGINATION_IX on pagination( my_date );
    
    BEGIN dbms_stats.gather_table_stats( 'TEST', 'PAGINATION', method_opt => 'FOR ALL COLUMNS' ); END;
    /
    

    Let say that we are going to process about 10% of records from the table between the below dates:

    select count(*) from pagination
    where my_date between date '2017-10-01' and '2017-10-21';
    
      COUNT(*)
    ----------
        736341
    

    and finally let say that our "processing" for simplicity, will consist in simple summing of lengths of one of field.
    This is a simple paging implementation:

    public class Pagination {
    
        public static class RecordPojo {
            Date myDate;
            String objectName;
    
            public Date getMyDate() {
                return myDate;
            }
            public RecordPojo setMyDate(Date myDate) {
                this.myDate = myDate;
                return this;
            }
            public String getObjectName() {
                return objectName;
            }
            public RecordPojo setObjectName(String objectName) {
                this.objectName = objectName;
                return this;
            }
        };
    
        static class MyPaginator{
    
            private Connection conn;
            private int pageSize;
            private int currentPage = 0;
    
            public MyPaginator( Connection conn, int pageSize ) {
                this.conn = conn;
                this.pageSize = pageSize;
            }
    
            static final String QUERY = ""
                    + "SELECT my_date, object_name FROM pagination "
                    + "WHERE my_date between date '2017-10-01' and '2017-10-21' "
                    + "ORDER BY my_date "
                    + "OFFSET ? ROWS FETCH NEXT ? ROWS ONLY";
    
            List<RecordPojo> getNextPage() {
                List<RecordPojo> list = new ArrayList<>();
                ResultSet rs = null;
                try( PreparedStatement ps = conn.prepareStatement(QUERY);) {
                    ps.setInt(1, pageSize * currentPage++ );
                    ps.setInt(2,  pageSize);
                    rs = ps.executeQuery();
    
                    while( rs.next()) {
                        list.add( new RecordPojo().setMyDate(rs.getDate(1)).setObjectName(rs.getString(2)));
                    }
    
                } catch (SQLException e) {
                    e.printStackTrace();
                }finally {
                    try{rs.close();}catch(Exception e) {}
                }
                return list;
            }
    
            public int getCurrentPage() {
                return currentPage;
            }
        }
    
    
        public static void main(String ...x) throws SQLException {
            OracleDataSource ds = new OracleDataSource();
            ds.setURL("jdbc:oracle:thin:test/test@//localhost:1521/orcl");
            long startTime = System.currentTimeMillis();
            long value = 0;
            int pageSize = 1000;
    
            try( Connection conn = ds.getConnection();){
                MyPaginator p = new MyPaginator(conn, pageSize);
                List<RecordPojo> list;
                while( ( list = p.getNextPage()).size() > 0 ) {
                    value += list.stream().map( y -> y.getObjectName().length()).mapToLong(Integer::longValue).sum();
                    System.out.println("Page: " + p.getCurrentPage());
                }
                System.out.format("==================\nValue = %d, Pages = %d,  time = %d seconds", value, p.getCurrentPage(), (System.currentTimeMillis() - startTime)/1000);
            }
        }
    }
    

    A result is:

    Value = 18312338, Pages = 738,  time = 2216 seconds
    

    Now let's test a very simple stream based solution - just take only one record, process it, discard it (freeing up memory), and take the next one.

    public class NoPagination {
    
        static final String QUERY = ""
                + "SELECT my_date, object_name FROM pagination "
                + "WHERE my_date between date '2017-10-01' and '2017-10-21' "
                + "ORDER BY my_date ";
    
        public static void main(String[] args) throws SQLException {
            OracleDataSource ds = new OracleDataSource();
            ds.setURL("jdbc:oracle:thin:test/test@//localhost:1521/orcl");
            long startTime = System.currentTimeMillis();
            long count = 0;
    
            ResultSet rs = null;
            PreparedStatement ps = null;
            try( Connection conn = ds.getConnection();){
                ps = conn.prepareStatement(QUERY);
                rs = ps.executeQuery();
                while( rs.next()) {
                    // processing
                    RecordPojo r = new RecordPojo().setMyDate(rs.getDate(1)).setObjectName(rs.getString(2)); 
                    count+=r.getObjectName().length();
                }
                System.out.format("==================\nValue = %d, time = %d seconds", count, (System.currentTimeMillis() - startTime)/1000);
            }finally {
                try { rs.close();}catch(Exception e) {}
                try { ps.close();}catch(Exception e) {}
            }
        }
    

    A result is:

    Value = 18312328, time = 11 seconds
    

    Yes - 2216 seconds / 11 seconds = 201 times faster - 20 100 % faster !!!
    Unbelievable ? You can test it yourself.
    This example shows how important it is to choose the right solution (right design patterns) to solve the problem.

    0 讨论(0)
  • 2021-01-24 11:01

    You didn't say if you were planning on adjusting "X" and "Y" each time you do the pagination. If you don't then approach is probably only valid if you have a high confidence that the data is fairly static.

    Consider the following example:

    My table T has 100 rows date timestamp for "today", with ID=1 to 100 respectively, and I want the last 20 rows for my first page. So I do this:

    select * 
    from T 
    where date_col = trunc(sysdate) 
    order by id desc
    fetch first 20 rows only
    

    I run my query and get ID=100 down to 80. So far so good - it is all on the user's page, and they take 30 seconds mins to read the data. During that time, another 17 records have been added to the table (ID=101 to 117).

    Now the user presses "Next Page"

    Now I run the query again to get the next set

    select * 
    from T 
    where date_col = trunc(sysdate) 
    order by id desc
    offset 20 fetch next 20 rows only
    

    They will not see rows 80 down to 60, which would be their expectation, becuase the data has changed. They would

    a) get rows ID=117 down to 97, and skip them due to the OFFSET b) then get rows ID=97 down to 77 to be displayed on screen

    They'll be confused because they are seeing pretty much the same set of rows as they did on the first page.

    For pagination against changing data, you generally want to stay away from the offset clause, and use your application to take note of where you got up to, ie

    Page 1

    select * 
    from T 
    where date_col = trunc(sysdate) 
    order by id desc
    fetch first 20 rows only
    

    I fetch ID=100 down to 80...I take note of the 80. My next query will then be

    select * 
    from T 
    where date_col = trunc(sysdate) 
    AND ID<80
    order by id desc
    fetch first 20 rows only
    

    and my next query would be

    select * 
    from T 
    where date_col = trunc(sysdate) 
    AND ID<60
    order by id desc
    fetch first 20 rows only
    

    and so forth.

    0 讨论(0)
提交回复
热议问题