PostgreSQL - fetch the row which has the Max value for a column

前端 未结 9 819
感动是毒
感动是毒 2020-11-28 18:38

I\'m dealing with a Postgres table (called \"lives\") that contains records with columns for time_stamp, usr_id, transaction_id, and lives_remaining. I need a query that wil

相关标签:
9条回答
  • 2020-11-28 19:13

    Another solution you might find useful.

    SELECT t.*
    FROM
        (SELECT
            *,
            ROW_NUMBER() OVER(PARTITION BY usr_id ORDER BY time_stamp DESC) as r
        FROM lives) as t
    WHERE t.r = 1
    
    0 讨论(0)
  • 2020-11-28 19:16

    I think you've got one major problem here: there's no monotonically increasing "counter" to guarantee that a given row has happened later in time than another. Take this example:

    timestamp   lives_remaining   user_id   trans_id
    10:00       4                 3         5
    10:00       5                 3         6
    10:00       3                 3         1
    10:00       2                 3         2
    

    You cannot determine from this data which is the most recent entry. Is it the second one or the last one? There is no sort or max() function you can apply to any of this data to give you the correct answer.

    Increasing the resolution of the timestamp would be a huge help. Since the database engine serializes requests, with sufficient resolution you can guarantee that no two timestamps will be the same.

    Alternatively, use a trans_id that won't roll over for a very, very long time. Having a trans_id that rolls over means you can't tell (for the same timestamp) whether trans_id 6 is more recent than trans_id 1 unless you do some complicated math.

    0 讨论(0)
  • 2020-11-28 19:21

    On a table with 158k pseudo-random rows (usr_id uniformly distributed between 0 and 10k, trans_id uniformly distributed between 0 and 30),

    By query cost, below, I am referring to Postgres' cost based optimizer's cost estimate (with Postgres' default xxx_cost values), which is a weighed function estimate of required I/O and CPU resources; you can obtain this by firing up PgAdminIII and running "Query/Explain (F7)" on the query with "Query/Explain options" set to "Analyze"

    • Quassnoy's query has a cost estimate of 745k (!), and completes in 1.3 seconds (given a compound index on (usr_id, trans_id, time_stamp))
    • Bill's query has a cost estimate of 93k, and completes in 2.9 seconds (given a compound index on (usr_id, trans_id))
    • Query #1 below has a cost estimate of 16k, and completes in 800ms (given a compound index on (usr_id, trans_id, time_stamp))
    • Query #2 below has a cost estimate of 14k, and completes in 800ms (given a compound function index on (usr_id, EXTRACT(EPOCH FROM time_stamp), trans_id))
      • this is Postgres-specific
    • Query #3 below (Postgres 8.4+) has a cost estimate and completion time comparable to (or better than) query #2 (given a compound index on (usr_id, time_stamp, trans_id)); it has the advantage of scanning the lives table only once and, should you temporarily increase (if needed) work_mem to accommodate the sort in memory, it will be by far the fastest of all queries.

    All times above include retrieval of the full 10k rows result-set.

    Your goal is minimal cost estimate and minimal query execution time, with an emphasis on estimated cost. Query execution can dependent significantly on runtime conditions (e.g. whether relevant rows are already fully cached in memory or not), whereas the cost estimate is not. On the other hand, keep in mind that cost estimate is exactly that, an estimate.

    The best query execution time is obtained when running on a dedicated database without load (e.g. playing with pgAdminIII on a development PC.) Query time will vary in production based on actual machine load/data access spread. When one query appears slightly faster (<20%) than the other but has a much higher cost, it will generally be wiser to choose the one with higher execution time but lower cost.

    When you expect that there will be no competition for memory on your production machine at the time the query is run (e.g. the RDBMS cache and filesystem cache won't be thrashed by concurrent queries and/or filesystem activity) then the query time you obtained in standalone (e.g. pgAdminIII on a development PC) mode will be representative. If there is contention on the production system, query time will degrade proportionally to the estimated cost ratio, as the query with the lower cost does not rely as much on cache whereas the query with higher cost will revisit the same data over and over (triggering additional I/O in the absence of a stable cache), e.g.:

                  cost | time (dedicated machine) |     time (under load) |
    -------------------+--------------------------+-----------------------+
    some query A:   5k | (all data cached)  900ms | (less i/o)     1000ms |
    some query B:  50k | (all data cached)  900ms | (lots of i/o) 10000ms |
    

    Do not forget to run ANALYZE lives once after creating the necessary indices.


    Query #1

    -- incrementally narrow down the result set via inner joins
    --  the CBO may elect to perform one full index scan combined
    --  with cascading index lookups, or as hash aggregates terminated
    --  by one nested index lookup into lives - on my machine
    --  the latter query plan was selected given my memory settings and
    --  histogram
    SELECT
      l1.*
     FROM
      lives AS l1
     INNER JOIN (
        SELECT
          usr_id,
          MAX(time_stamp) AS time_stamp_max
         FROM
          lives
         GROUP BY
          usr_id
      ) AS l2
     ON
      l1.usr_id     = l2.usr_id AND
      l1.time_stamp = l2.time_stamp_max
     INNER JOIN (
        SELECT
          usr_id,
          time_stamp,
          MAX(trans_id) AS trans_max
         FROM
          lives
         GROUP BY
          usr_id, time_stamp
      ) AS l3
     ON
      l1.usr_id     = l3.usr_id AND
      l1.time_stamp = l3.time_stamp AND
      l1.trans_id   = l3.trans_max
    

    Query #2

    -- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
    -- this results in a single table scan and one nested index lookup into lives,
    --  by far the least I/O intensive operation even in case of great scarcity
    --  of memory (least reliant on cache for the best performance)
    SELECT
      l1.*
     FROM
      lives AS l1
     INNER JOIN (
       SELECT
         usr_id,
         MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
           AS compound_time_stamp
        FROM
         lives
        GROUP BY
         usr_id
      ) AS l2
    ON
      l1.usr_id = l2.usr_id AND
      EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
      l1.trans_id = l2.compound_time_stamp[2]
    

    2013/01/29 update

    Finally, as of version 8.4, Postgres supports Window Function meaning you can write something as simple and efficient as:

    Query #3

    -- use Window Functions
    -- performs a SINGLE scan of the table
    SELECT DISTINCT ON (usr_id)
      last_value(time_stamp) OVER wnd,
      last_value(lives_remaining) OVER wnd,
      usr_id,
      last_value(trans_id) OVER wnd
     FROM lives
     WINDOW wnd AS (
       PARTITION BY usr_id ORDER BY time_stamp, trans_id
       ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
     );
    
    0 讨论(0)
  • 2020-11-28 19:27

    There is a new option in Postgressql 9.5 called DISTINCT ON

    SELECT DISTINCT ON (location) location, time, report
        FROM weather_reports
        ORDER BY location, time DESC;
    

    It eliminates duplicate rows an leaves only the first row as defined my the ORDER BY clause.

    see the official documentation

    0 讨论(0)
  • 2020-11-28 19:29

    I would propose a clean version based on DISTINCT ON (see docs):

    SELECT DISTINCT ON (usr_id)
        time_stamp,
        lives_remaining,
        usr_id,
        trans_id
    FROM lives
    ORDER BY usr_id, time_stamp DESC, trans_id DESC;
    
    0 讨论(0)
  • 2020-11-28 19:29
    SELECT  l.*
    FROM    (
            SELECT DISTINCT usr_id
            FROM   lives
            ) lo, lives l
    WHERE   l.ctid = (
            SELECT ctid
            FROM   lives li
            WHERE  li.usr_id = lo.usr_id
            ORDER BY
              time_stamp DESC, trans_id DESC
            LIMIT 1
            )
    

    Creating an index on (usr_id, time_stamp, trans_id) will greatly improve this query.

    You should always, always have some kind of PRIMARY KEY in your tables.

    0 讨论(0)
提交回复
热议问题