Efficient latest record query with Postgresql

前端 未结 5 921
深忆病人
深忆病人 2020-12-25 09:59

I need to do a big query, but I only want the latest records.

For a single entry I would probably do something like

SELECT * FROM table WHERE id = ?          


        
相关标签:
5条回答
  • 2020-12-25 10:13

    If you don't want to change your data model, you can use DISTINCT ON to fetch the newest record from table "b" for each entry in "a":

    SELECT DISTINCT ON (a.id) *
    FROM a
    INNER JOIN b ON a.id=b.id
    ORDER BY a.id, b.date DESC
    

    If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:

    CREATE INDEX b_id_date ON b (id, date DESC)
    
    SELECT DISTINCT ON (b.id) *
    FROM a
    INNER JOIN b ON a.id=b.id
    ORDER BY b.id, b.date DESC
    

    Alternatively, if you want to sort records from table "a" some way:

    SELECT DISTINCT ON (sort_column, a.id) *
    FROM a
    INNER JOIN b ON a.id=b.id
    ORDER BY sort_column, a.id, b.date DESC
    

    Alternative approaches

    However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.

    You could create a new table, which only holds the newest "b" record for each a.id -- or even move those columns into the "a" table itself.

    0 讨论(0)
  • 2020-12-25 10:16

    On method - create a small derivative table containing the most recent update / insertion times on table a - call this table a_latest. Table a_latest will need sufficient granularity to meet your specific query requirements. In your case it should be sufficient to use

    CREATE TABLE 
    a_latest 
    ( id INTEGER NOT NULL, 
      date TSTAMP NOT NULL, 
      PRIMARY KEY (id, max_time) );
    

    Then use a query similar to that suggested by najmeddine :

    SELECT a.* 
    FROM TABLE a, TABLE a_latest 
    USING ( id, date );
    

    The trick then is keeping a_latest up to date. Do this using a trigger on insertions and updates. A trigger written in plppgsql is fairly easy to write. I am happy to provide an example if you wish.

    The point here is that computation of the latest update time is taken care of during the updates themselves. This shifts more of the load away from the query.

    0 讨论(0)
  • 2020-12-25 10:20

    If you have many rows per id's you definitely want a correlated subquery. It will make 1 index lookup per id, but this is faster than sorting the whole table.

    Something like :

    SELECT a.id,
    (SELECT max(t.date) FROM table t WHERE t.id = a.id) AS lastdate
    FROM table2;
    

    The 'table2' you will use is not the table you mention in your query above, because here you need a list of distinct id's for good performance. Since your ids are probably FKs into another table, use this one.

    0 讨论(0)
  • 2020-12-25 10:31

    this could be more eficient. Difference: query for table b is executed only 1 time, your correlated subquery is executed for every row:

    SELECT * 
    FROM table a 
    JOIN (SELECT ID, max(date) maxDate
            FROM table
          GROUP BY ID) b
    ON a.ID = b.ID AND a.date = b.maxDate
    WHERE ID IN $LIST 
    
    0 讨论(0)
  • 2020-12-25 10:32

    what do you think about this?

    select * from (
       SELECT a.*, row_number() over (partition by a.id order by date desc) r 
       FROM table a where ID IN $LIST 
    )
    WHERE r=1
    

    i used it a lot on the past

    0 讨论(0)
提交回复
热议问题