Best performance in sampling repeated value from a grouped column

前端 未结 2 1835
一向
一向 2021-02-13 02:02

This question is about the functionality of first_value(), using another function or workaround.

It is also about \"little gain in performance\" in big tables. To use eg

2条回答
  •  后悔当初
    2021-02-13 02:32

    If you really don't care which member of the set is picked, and if you don't need to compute additional aggregates (like count), there is a fast and simple alternative with DISTINCT ON (x) without ORDER BY:

    SELECT DISTINCT ON (x) x, y, z FROM t;
    

    x, y and z are from the same row, but the row is an arbitrary pick from each set of rows with the same x.

    If you need a count anyway, your options with regard to performance are limited since the whole table has to be read in either case. Still, you can combine it with window functions in the same SELECT:

    SELECT DISTINCT ON (x) x, y, z, count(*) OVER (PARTITION BY x) AS x_count FROM t;
    

    Consider the sequence of events in a SELECT query:

    • Best way to get result count before LIMIT was applied

    Depending on requirements, there may be faster ways to get counts:

    • Fast way to discover the row count of a table in PostgreSQL

    In combination with GROUP BY the only realistic option I see to gain some performance is the first_last_agg extension. But don't expect much.

    For other use cases without count (including the simple case at the top), there are faster solutions, depending on your exact use case. In particular to get "first" or "last" value of each set. Emulate a loose index scan. (Like @Mihai commented):

    • Optimize GROUP BY query to retrieve latest record per user

提交回复
热议问题