Best performance in sampling repeated value from a grouped column

前端 未结 2 1826
一向
一向 2021-02-13 02:02

This question is about the functionality of first_value(), using another function or workaround.

It is also about \"little gain in performance\" in big tables. To use eg

相关标签:
2条回答
  • 2021-02-13 02:27

    Not an offical source, but some thoughts an a question perceived as rather generic:

    In general aggregators neeed to process all matching rows. From your question text you might target aggregators that try identifying specific values (max, min, first, last, n-th, etc). Those could benefit from datastructures that maintain the proper values for a specific such aggregator. Then "selecting" that value can be sped up drastically.
    E.g. some databases keep track of max and min values of columns.
    You can view this support as highly specialised internal indexs that are maintained by the system itself and not under (direct) control of a user.

    Now postgresql focusses more on support that helps improving queries in general, not just special cases. So, they avoid adding effort for speeding up special cases that are not obviously benefitting a broad range of use cases.

    Back to speeding up sample value aggregators.

    With aggregators having to process all rows in general case and not hving a general strategy that allows short circuiting that requirement for aggregators that try identying specific values (sample kind aggregators for now), it is obvious that any reformulating of a query that does not lead to a reduced set of rows that need to be processed, will take similar time to complete.

    For speeding up such queries beyond processing all rows you will need a supporting datastructure. With databases this usually is provided in the form of an index.

    You also could benefit from special execution operations that allow reducing the number of rows to be read.

    With pg you have the capability of providing own index implementation. So you could add an implementation that best supports a special kind of aggregator you are interested in. (At least for cases where you do need to run such queries often.)

    Also, execution operations like index only scans or lazy evaluation with recursive queries may allow writing a specific query in a way that speeds compared to "straight" coding.

    If you are targeting your question more into general approaches you might better consult with researchers on such topics as this then is beyond anything SO is intended to provide.

    If you have specific (set of) queries that need to be improved, providing explicit questions on those might allow the community to help identifying potential optimizations. Trying to optimize without good base of measurement leads nowhere, as what yields perfect result in one case might kill performance in another.

    0 讨论(0)
  • 2021-02-13 02:32

    If you really don't care which member of the set is picked, and if you don't need to compute additional aggregates (like count), there is a fast and simple alternative with DISTINCT ON (x) without ORDER BY:

    SELECT DISTINCT ON (x) x, y, z FROM t;
    

    x, y and z are from the same row, but the row is an arbitrary pick from each set of rows with the same x.

    If you need a count anyway, your options with regard to performance are limited since the whole table has to be read in either case. Still, you can combine it with window functions in the same SELECT:

    SELECT DISTINCT ON (x) x, y, z, count(*) OVER (PARTITION BY x) AS x_count FROM t;
    

    Consider the sequence of events in a SELECT query:

    • Best way to get result count before LIMIT was applied

    Depending on requirements, there may be faster ways to get counts:

    • Fast way to discover the row count of a table in PostgreSQL

    In combination with GROUP BY the only realistic option I see to gain some performance is the first_last_agg extension. But don't expect much.

    For other use cases without count (including the simple case at the top), there are faster solutions, depending on your exact use case. In particular to get "first" or "last" value of each set. Emulate a loose index scan. (Like @Mihai commented):

    • Optimize GROUP BY query to retrieve latest record per user
    0 讨论(0)
提交回复
热议问题