I am trying to debug a query in PostgreSQL that I\'ve built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:
Correctness first: I suspect a bug in your query:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
Unlike my referenced answer, you join on a time interval: (time_open, time_close]
. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?
A simple fix would be to decide bucket membership based on time_open
(or time_close
) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.
Also, you are looking for max(high)
per bucket, which is different in nature from count(*)
in my referenced answer.
And your buckets are simple intervals per hour?
Then we can radically simplify. Working with just time_open
:
SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
FROM historical_ohlcv
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
AND time_open >= now() - interval '5 months' -- frame_start
AND time_open < now() -- frame_end
GROUP BY 1
ORDER BY 1;
Related:
It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.
Are WHERE
conditions variable?
How many distinct values in exchange_symbol
and symbol_id
?
Avg. row size? What do you get for:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);
Is the table read-only?
Assuming you always filter on exchange_symbol
and symbol_id
and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC)
to support this query. Index columns in this order. Related:
Depending on data distribution and other details a LEFT JOIN LATERAL
solution might be another option. Related:
Aside from all that, you EXPLAIN
plan exhibits some very bad estimates:
Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related: