Best performance in sampling repeated value from a grouped column

前端 未结 2 1828
一向
一向 2021-02-13 02:02

This question is about the functionality of first_value(), using another function or workaround.

It is also about \"little gain in performance\" in big tables. To use eg

2条回答
  •  被撕碎了的回忆
    2021-02-13 02:27

    Not an offical source, but some thoughts an a question perceived as rather generic:

    In general aggregators neeed to process all matching rows. From your question text you might target aggregators that try identifying specific values (max, min, first, last, n-th, etc). Those could benefit from datastructures that maintain the proper values for a specific such aggregator. Then "selecting" that value can be sped up drastically.
    E.g. some databases keep track of max and min values of columns.
    You can view this support as highly specialised internal indexs that are maintained by the system itself and not under (direct) control of a user.

    Now postgresql focusses more on support that helps improving queries in general, not just special cases. So, they avoid adding effort for speeding up special cases that are not obviously benefitting a broad range of use cases.

    Back to speeding up sample value aggregators.

    With aggregators having to process all rows in general case and not hving a general strategy that allows short circuiting that requirement for aggregators that try identying specific values (sample kind aggregators for now), it is obvious that any reformulating of a query that does not lead to a reduced set of rows that need to be processed, will take similar time to complete.

    For speeding up such queries beyond processing all rows you will need a supporting datastructure. With databases this usually is provided in the form of an index.

    You also could benefit from special execution operations that allow reducing the number of rows to be read.

    With pg you have the capability of providing own index implementation. So you could add an implementation that best supports a special kind of aggregator you are interested in. (At least for cases where you do need to run such queries often.)

    Also, execution operations like index only scans or lazy evaluation with recursive queries may allow writing a specific query in a way that speeds compared to "straight" coding.

    If you are targeting your question more into general approaches you might better consult with researchers on such topics as this then is beyond anything SO is intended to provide.

    If you have specific (set of) queries that need to be improved, providing explicit questions on those might allow the community to help identifying potential optimizations. Trying to optimize without good base of measurement leads nowhere, as what yields perfect result in one case might kill performance in another.

提交回复
热议问题