Improving OFFSET performance in PostgreSQL

后端 未结 5 817
失恋的感觉
失恋的感觉 2021-01-29 20:46

I have a table I\'m doing an ORDER BY on before a LIMIT and OFFSET in order to paginate.

Adding an index on the ORDER BY column makes a massive difference to performance

相关标签:
5条回答
  • 2021-01-29 21:17

    I don't know anything about "counted b-tree indexes", but one thing we've done in our application to help with this is break our queries into two, possibly using a sub-query. My apologies for wasting your time if you're already doing this.

    SELECT *
    FROM massive_table
    WHERE id IN (
        SELECT id
        FROM massive_table
        WHERE ...
        LIMIT 50
        OFFSET 500000
    );
    

    The advantage here is that, while it still has to calculate the proper ordering of everything, it doesn't order the entire row--only the id column.

    0 讨论(0)
  • 2021-01-29 21:18

    recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one. i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination. In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":

    create or replace function load_records ()
    returns VOID as $$
    BEGIN
    drop sequence if exists temp_seq;
    create temp sequence temp_seq;
    insert into tmp_table
    SELECT linea.*
    FROM
    (
    select nextval('temp_seq') as ROWNUM,* from table1 t1
     join table2 t2 on (t2.fieldpk = t1.fieldpk)
     join table3 t3 on (t3.fieldpk = t2.fieldpk)
    ) linea;
    END;
    $$ language plpgsql;
    

    after that, i can paginate without count each row but using the sequence assigned:

    select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
    

    From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:

    @Override
    public E get(int index) {
      if (bufferParcial.size() <= (index - lastIndexRoulette))
      {
        lastIndexRoulette = index;
        bufferParcial.removeAll(bufferParcial);
        bufferParcial = new ArrayList<E>();
            bufferParcial.addAll(daoInterface.getBufferParcial());
        if (bufferParcial.isEmpty())
        {
            return null;
        }
    
      }
      return bufferParcial.get(index - lastIndexRoulette);<br>
    }
    

    by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.

    results for this approach can be seen here http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

    0 讨论(0)
  • 2021-01-29 21:26

    You might want a computed index.

    Let's create a table:

    create table sales(day date, amount real);
    

    And fill it with some random stuff:

    insert into sales 
        select current_date + s.a as day, random()*100 as amount
        from generate_series(1,20);
    

    Index it by day, nothing special here:

    create index sales_by_day on sales(day);
    

    Create a row position function. There are other approaches, this one is the simplest:

    create or replace function sales_pos (date) returns bigint 
       as 'select count(day) from sales where day <= $1;' 
       language sql immutable;
    

    Check if it works (don't call it like this on large datasets though):

    select sales_pos(day), day, amount from sales;
    
         sales_pos |    day     |  amount  
        -----------+------------+----------
                 1 | 2011-07-08 |  41.6135
                 2 | 2011-07-09 |  19.0663
                 3 | 2011-07-10 |  12.3715
        ..................
    

    Now the tricky part: add another index computed on the sales_pos function values:

    create index sales_by_pos on sales using btree(sales_pos(day));
    

    Here is how you use it. 5 is your "offset", 10 is the "limit":

    select * from sales where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
    
            day     | amount  
        ------------+---------
         2011-07-12 | 94.3042
         2011-07-13 | 12.9532
         2011-07-14 | 74.7261
        ...............
    

    It is fast, because when you call it like this, Postgres uses precalculated values from the index:

    explain select * from sales 
      where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
    
                                        QUERY PLAN                                
        --------------------------------------------------------------------------
         Index Scan using sales_by_pos on sales  (cost=0.50..8.77 rows=1 width=8)
           Index Cond: ((sales_pos(day) >= 5) AND (sales_pos(day) < 15))
    

    Hope it helps.

    0 讨论(0)
  • 2021-01-29 21:31

    It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.

    It seems a little unusual to me. Most people, most of the time, don't seem to skim through very many pages. It's something I'd support, but wouldn't work hard to optimize.

    But anyway . . .

    Since your application code knows which ordered values it's already seen, it should be able to reduce the result set and reduce the offset by excluding those values in the WHERE clause. Assuming you order a single column, and it's sorted ascending, your app code can store the last value on the page, then add AND your-ordered-column-name > last-value-seen to the WHERE clause in some appropriate way.

    0 讨论(0)
  • 2021-01-29 21:38

    Instead of using an OFFSET, a very efficient trick is to use a temporary table:

    CREATE  TEMPORARY TABLE just_index AS
    SELECT ROW_NUMBER() OVER (ORDER BY myID), myID
    FROM mytable;
    

    For 10 000 000 rows it needs about 10s to be created. Then you want to use SELECT or UPDATE your table, you simply:

    SELECT * FROM mytable INNER JOIN (SELECT just_index.myId FROM just_index WHERE row_number >= *your offset* LIMIT 1000000) indexes ON mytable.myID = indexes.myID
    

    Filtering mytable with only just_index is more efficient (in my case) with a INNER JOIN than with a WHERE myID IN (SELECT ...)

    This way you don't have to store the last myId value, you simply replace the offset with a WHERE clause, that uses indexes

    0 讨论(0)
提交回复
热议问题