Slow LEFT JOIN on CTE with time intervals

前端 未结 1 1845
予麋鹿
予麋鹿 2021-01-21 17:32

I am trying to debug a query in PostgreSQL that I\'ve built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:

1条回答
  •  梦毁少年i
    2021-01-21 18:05

    Correctness first: I suspect a bug in your query:

     LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
                                     AND ohlcv.time_close < g.end_time
    

    Unlike my referenced answer, you join on a time interval: (time_open, time_close]. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?

    A simple fix would be to decide bucket membership based on time_open (or time_close) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.

    Also, you are looking for max(high) per bucket, which is different in nature from count(*) in my referenced answer.

    And your buckets are simple intervals per hour?

    Then we can radically simplify. Working with just time_open:

    SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
    FROM   historical_ohlcv
    WHERE  exchange_symbol = 'BINANCE'
    AND    symbol_id = 'ETHBTC'
    AND    time_open >= now() - interval '5 months'  -- frame_start
    AND    time_open <  now()                        -- frame_end
    GROUP  BY 1
    ORDER  BY 1;
    

    Related:

    • Resample on time series data

    It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.

    Are WHERE conditions variable?
    How many distinct values in exchange_symbol and symbol_id?
    Avg. row size? What do you get for:

    SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);
    

    Is the table read-only?

    Assuming you always filter on exchange_symbol and symbol_id and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC) to support this query. Index columns in this order. Related:

    • Multicolumn index and performance

    Depending on data distribution and other details a LEFT JOIN LATERAL solution might be another option. Related:

    • How to find an average of values for time intervals in postgres
    • Optimize GROUP BY query to retrieve latest record per user

    Aside from all that, you EXPLAIN plan exhibits some very bad estimates:

    • https://explain.depesz.com/s/E5yI

    Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related:

    • Keep PostgreSQL from sometimes choosing a bad query plan
    • Aggressive Autovacuum on PostgreSQL

    0 讨论(0)
提交回复
热议问题