How do I improve date-based query performance on a large table?

拜拜、爱过 提交于 2019-12-19 10:08:05

问题


This is related to 2 other questions I posted (sounds like I should post this as a new question) - the feedback helped, but I think the same issue will come back the next time I need to insert data. Things were running slowly still which forced me to temporarily remove some of the older data so that only 2 months' worth remained in the table that I'm querying.

Indexing strategy for different combinations of WHERE clauses incl. text patterns

How to get date_part query to hit index?

Giving further detail this time - hopefully it will help pinpoint the issue:

  • PG version 10.7 (running on heroku
  • Total DB size: 18.4GB (this contains 2 months worth of data, and it will grow at approximately the same rate each month)
  • 15GB RAM
  • Total available storage: 512GB
  • The largest table (the one that the slowest query is acting on) is 9.6GB (it's the largest chunk of the total DB) - about 10 million records

Schema of the largest table:

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;

Slow query:

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
    datelocal >= '3-1-2019' AND
    datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

The date range in this query will generally be for an entire month (it accepts user input from a web based report) - as you can see, I tried creating an index for each month's worth of data. That helped, but as far as I can tell, unless the query has recently been run (putting the results into the cache), it can still take up to a minute to run.

Explain analyze results:

Finalize GroupAggregate  (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  ->  Sort  (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              ->  Partial HashAggregate  (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    ->  Parallel Index Scan using reportimpression_mar2019_index on reportimpression  (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms

I wouldn't think 10 million records would be too much to handle, especially given that I recently bumped up the PG plan that I'm on to try to throw resources at it, so I assume that the issue is still just either my indexes or my queries not being very efficient.


回答1:


A materialized view is the way to go for what you outlined. Querying past months of read-only data works without refreshing it. You may want to special-case the current month if you need to cover that, too.

The underlying query can still benefit from an index, and there are two directions you might take:

First off, partial indexes like you have now won't buy much in your scenario, not worth it. If you collect many more months of data and mostly query by month (and add / drop rows by month) table partitioning might be an idea, then you have your indexes partitioned automatically, too. I'd consider Postgres 11 or even the upcoming Postgres 12 for this, though.)

If your rows are wide, create an index that allows index-only scans. Like:

CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal, views, gender);

Related:

  • How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?

Or INCLUDE additional columns in Postgres 11 or later:

CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);

Else, if your rows are physically sorted by datelocal, consider a BRIN index. It's extremely small and probably about as fast as a B-tree index for your case. (But being so small it will stay cached much easier and not push other data out as much.)

CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);

You may be interested in CLUSTER or pg_repack to physically sort table rows. pg_repack can do it without exclusive locks on the table and even without a btree index (required by CLUSTER). But it's an additional module not shipped with the standard distribution of Postgres.

Related:

  • Optimize Postgres deletion of orphaned records
  • How to reclaim disk space after delete without rebuilding table?



回答2:


Your execution plan seems to be doing the right thing.

Things you can do to improve, in descending order of effectiveness:

  • Use a materialized view that pre-aggregates the data

  • Don't use a hosted database, use your own iron with good local storage and lots of RAM.

  • Use only one index instead of several partitioned ones. This is not primarily a performance advice (the query will probably not be measurably slower unless you have a lot of indexes), but it will ease the management burden.



来源:https://stackoverflow.com/questions/55540605/how-do-i-improve-date-based-query-performance-on-a-large-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!