Slow LEFT JOIN on CTE with time intervals

前端未结

关注

 1  1848

予麋鹿 2021-01-21 17:32

I am trying to debug a query in PostgreSQL that I\'ve built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:

1条回答

梦毁少年i (楼主)

2021-01-21 18:05
Correctness first: I suspect a bug in your query:
```
 LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
                                 AND ohlcv.time_close < g.end_time
```
Unlike my referenced answer, you join on a time interval: (time_open, time_close]. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?

A simple fix would be to decide bucket membership based on time_open (or time_close) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.

Also, you are looking for max(high) per bucket, which is different in nature from count(*) in my referenced answer.

And your buckets are simple intervals per hour?

Then we can radically simplify. Working with just time_open:
```
SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
FROM   historical_ohlcv
WHERE  exchange_symbol = 'BINANCE'
AND    symbol_id = 'ETHBTC'
AND    time_open >= now() - interval '5 months'  -- frame_start
AND    time_open <  now()                        -- frame_end
GROUP  BY 1
ORDER  BY 1;
```
Related:
- Resample on time series data
It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.

Are WHERE conditions variable?
How many distinct values in exchange_symbol and symbol_id?
Avg. row size? What do you get for:
```
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);
```
Is the table read-only?

Assuming you always filter on exchange_symbol and symbol_id and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC) to support this query. Index columns in this order. Related:
- Multicolumn index and performance
Depending on data distribution and other details a LEFT JOIN LATERAL solution might be another option. Related:
- How to find an average of values for time intervals in postgres
- Optimize GROUP BY query to retrieve latest record per user
Aside from all that, you EXPLAIN plan exhibits some very bad estimates:
- https://explain.depesz.com/s/E5yI
Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related:
- Keep PostgreSQL from sometimes choosing a bad query plan
- Aggressive Autovacuum on PostgreSQL
0 讨论(0)
发布评论:

提交评论
- 加载中...