PostgreSQL query runs faster with index scan, but engine chooses hash join

后端 未结 4 1961
逝去的感伤
逝去的感伤 2020-12-08 05:35

The query:

SELECT \"replays_game\".*
FROM \"replays_game\"
INNER JOIN
 \"replays_playeringame\" ON \"replays_game\".\"id\" = \"replays_playeringame\".\"game_         


        
相关标签:
4条回答
  • 2020-12-08 05:43

    You might get a better execution plan using a multiple column (player_id, game_id) index on the replays_playeringame table. This avoids having to use a random page seek to look up the game id(s) for the player id.

    0 讨论(0)
  • 2020-12-08 05:47

    My guess is that you are using the default random_page_cost = 4, which is way too high, making index scan too costly.

    I try to reconstruct the 2 tables with this script:

    CREATE TABLE replays_game (
        id integer NOT NULL,
        PRIMARY KEY (id)
    );
    
    CREATE TABLE replays_playeringame (
        player_id integer NOT NULL,
        game_id integer NOT NULL,
        PRIMARY KEY (player_id, game_id),
        CONSTRAINT replays_playeringame_game_fkey
            FOREIGN KEY (game_id) REFERENCES replays_game (id)
    );
    
    CREATE INDEX ix_replays_playeringame_game_id
        ON replays_playeringame (game_id);
    
    -- 150k games
    INSERT INTO replays_game
    SELECT generate_series(1, 150000);
    
    -- ~150k players, ~2 games each
    INSERT INTO replays_playeringame
    select trunc(random() * 149999 + 1), generate_series(1, 150000);
    
    INSERT INTO replays_playeringame
    SELECT *
    FROM
        (
            SELECT
                trunc(random() * 149999 + 1) as player_id,
                generate_series(1, 150000) as game_id
        ) AS t
    WHERE
        NOT EXISTS (
            SELECT 1
            FROM replays_playeringame
            WHERE
                t.player_id = replays_playeringame.player_id
                AND t.game_id = replays_playeringame.game_id
        )
    ;
    
    -- the heavy player with 3000 games
    INSERT INTO replays_playeringame
    select 999999, generate_series(1, 3000);
    

    With the default value of 4:

    game=# set random_page_cost = 4;
    SET
    game=# explain analyse SELECT "replays_game".*
    FROM "replays_game"
    INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
    WHERE "replays_playeringame"."player_id" = 999999;
                                                                         QUERY PLAN                                                                      
    -----------------------------------------------------------------------------------------------------------------------------------------------------
     Hash Join  (cost=1483.54..4802.54 rows=3000 width=4) (actual time=3.640..110.212 rows=3000 loops=1)
       Hash Cond: (replays_game.id = replays_playeringame.game_id)
       ->  Seq Scan on replays_game  (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.012..34.261 rows=150000 loops=1)
       ->  Hash  (cost=1446.04..1446.04 rows=3000 width=4) (actual time=3.598..3.598 rows=3000 loops=1)
             Buckets: 1024  Batches: 1  Memory Usage: 106kB
             ->  Bitmap Heap Scan on replays_playeringame  (cost=67.54..1446.04 rows=3000 width=4) (actual time=0.586..2.041 rows=3000 loops=1)
                   Recheck Cond: (player_id = 999999)
                   ->  Bitmap Index Scan on replays_playeringame_pkey  (cost=0.00..66.79 rows=3000 width=0) (actual time=0.560..0.560 rows=3000 loops=1)
                         Index Cond: (player_id = 999999)
     Total runtime: 110.621 ms
    

    After lowering it to 2:

    game=# set random_page_cost = 2;
    SET
    game=# explain analyse SELECT "replays_game".*
    FROM "replays_game"
    INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
    WHERE "replays_playeringame"."player_id" = 999999;
                                                                      QUERY PLAN                                                                   
    -----------------------------------------------------------------------------------------------------------------------------------------------
     Nested Loop  (cost=45.52..4444.86 rows=3000 width=4) (actual time=0.418..27.741 rows=3000 loops=1)
       ->  Bitmap Heap Scan on replays_playeringame  (cost=45.52..1424.02 rows=3000 width=4) (actual time=0.406..1.502 rows=3000 loops=1)
             Recheck Cond: (player_id = 999999)
             ->  Bitmap Index Scan on replays_playeringame_pkey  (cost=0.00..44.77 rows=3000 width=0) (actual time=0.388..0.388 rows=3000 loops=1)
                   Index Cond: (player_id = 999999)
       ->  Index Scan using replays_game_pkey on replays_game  (cost=0.00..0.99 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3000)
             Index Cond: (id = replays_playeringame.game_id)
     Total runtime: 28.542 ms
    (8 rows)
    

    If using SSD, I would lower it further to 1.1.

    As for your last question, I really think you should stick with postgresql. I have experience with postgresql and mssql, and I need to put in triple the effort into the later for it to perform half as well as the former.

    0 讨论(0)
  • 2020-12-08 05:52

    I ran sayap's testbed-code (Thanks!) , with the following modifications:

    • code is run four times with random_page_cost set to 8,4,2,1; in that order. (the cpc=8 is intended to prime the disk-buffer-cache)
    • The test is repeated with a reduced (1/2,1/4,1/8) fraction of the hard-hitters (respectively: 3K, 1K5,750 and 375 hardhitters; the rest of the records is kept unchanged.
    • These 4*4 tests are repeated with a lower setting (64K, the minimum) for work_mem.

    After this run, I did the same run, but scaled up tenfold: with 1M5 records (30K hard-hitters)

    Currently, I am running the same test with a hundred-fold scale-up, but the initialisation is rather slow...

    Results The entries in the cells are the total time in msec plus a string that denotes the chosen queryplan. (only a handfull of plans occur)

    Original 3K / 150K  work_mem=16M
    
    rpc     |       3K      |       1K5     |       750     |       375
    --------+---------------+---------------+---------------+------------
    8*      | 50.8  H.BBi.HS| 44.3  H.BBi.HS| 38.5  H.BBi.HS| 41.0  H.BBi.HS
    4       | 43.6  H.BBi.HS| 48.6  H.BBi.HS| 4.34  NBBi    | 1.33  NBBi
    2       | 6.92  NBBi    | 3.51  NBBi    | 4.61  NBBi    | 1.24  NBBi
    1       | 6.43  NII     | 3.49  NII     | 4.19  NII     | 1.18  NII
    
    
    Original 3K / 150K work_mem=64K
    
    rpc     |       3K      |       1K5     |       750     |       375
    --------+---------------+---------------+---------------+------------
    8*      | 74.2  H.BBi.HS| 69.6  NBBi    | 62.4  H.BBi.HS| 66.9  H.BBi.HS
    4       | 6.67  NBBi    | 8.53  NBBi    | 1.91  NBBi    | 2.32  NBBi
    2       | 6.66  NBBi    | 3.6   NBBi    | 1.77  NBBi    | 0.93  NBBi
    1       | 7.81  NII     | 3.26  NII     | 1.67  NII     | 0.86  NII
    
    
    Scaled 10*: 30K / 1M5  work_mem=16M
    
    rpc     |       30K     |       15K     |       7k5     |       3k75
    --------+---------------+---------------+---------------+------------
    8*      | 623   H.BBi.HS| 556   H.BBi.HS| 531   H.BBi.HS| 14.9  NBBi
    4       | 56.4  M.I.sBBi| 54.3  NBBi    | 27.1  NBBi    | 19.1  NBBi
    2       | 71.0  NBBi    | 18.9  NBBi    | 9.7   NBBi    | 9.7   NBBi
    1       | 79.0  NII     | 35.7  NII     | 17.7  NII     | 9.3   NII
    
    
    Scaled 10*: 30K / 1M5  work_mem=64K
    
    rpc     |       30K     |       15K     |       7k5     |       3k75
    --------+---------------+---------------+---------------+------------
    8*      | 729   H.BBi.HS| 722   H.BBi.HS| 723   H.BBi.HS| 19.6  NBBi
    4       | 55.5  M.I.sBBi| 41.5  NBBi    | 19.3  NBBi    | 13.3  NBBi
    2       | 70.5  NBBi    | 41.0  NBBi    | 26.3  NBBi    | 10.7  NBBi
    1       | 69.7  NII     | 38.5  NII     | 20.0  NII     | 9.0   NII
    
    Scaled 100*: 300K / 15M  work_mem=16M
    
    rpc     |       300k    |       150K    |       75k     |       37k5
    --------+---------------+---------------+---------------+---------------
    8*      |7314   H.BBi.HS|9422   H.BBi.HS|6175   H.BBi.HS| 122   N.BBi.I
    4       | 569   M.I.sBBi| 199   M.I.sBBi| 142   M.I.sBBi| 105   N.BBi.I
    2       | 527   M.I.sBBi| 372   N.BBi.I | 198   N.BBi.I | 110   N.BBi.I
    1       | 694   NII     | 362   NII     | 190   NII     | 107   NII
    
    Scaled 100*: 300K / 15M  work_mem=64K
    
    rpc     |       300k    |       150k    |       75k     |       37k5
    --------+---------------+---------------+---------------+------------
    8*      |22800 H.BBi.HS |21920 H.BBi.HS | 20630 N.BBi.I |19669  H.BBi.HS
    4       |22095 H.BBi.HS |  284 M.I.msBBi| 205   B.BBi.I |  116  N.BBi.I
    2       |  528 M.I.msBBi|  399  N.BBi.I | 211   N.BBi.I |  110  N.BBi.I
    1       |  718 NII      |  364  NII     | 200   NII     |  105  NII
    
    [8*] Note: the RandomPageCost=8 runs were only intended as a prerun to prime the disk buffer cache; the results should be ignored.
    
    Legend for node types:
    N := Nested loop
    M := Merge join
    H := Hash (or Hash join)
    B := Bitmap heap scan
    Bi := Bitmap index scan
    S := Seq scan
    s := sort
    m := materialise
    

    Preliminary conclusion:

    • "the working set" for the original query is too small: all of it fits in core, resulting in the cost of page fetches to be grossly overestimated. Setting RPC to 2 (or 1) "solves" this problem, but once the query is scaled-up, the page-costs become dominant, and RPC=4 becomes comparable or even better.

    • Setting work_mem to a lower value is another way to make the optimiser shift to index-scans (instead of hash+bitmap-scans). The differences I found are smaller than what Sayap reported. Maybe I have more effective_cache_size, or he forgot to prime the cache?

    • The optimiser is known to have problems with "skewed" distributions (and "skewed" or "peaked" multidimentional distributions) The testruns with 1/4 and 1/8 of the initial 3K/150K hardhitters show that this effect vanishes once the "peak" flattens out.
    • Something happens at the 2% boundary: the 3000/150000 gererate different (worse) plans, than those with <2% hardhitters. Could this be the granularity of the histograms ?
    0 讨论(0)
  • 2020-12-08 05:56

    This is an old post, but quite helpful that I just encountered a similar issue.

    Here is my finding so far. Given there are 151208 rows in the replays_game, the average cost of hitting an item is about log(151208)=12. Since there are 3395 records in replays_playeringame after filtering, the average cost is 12*3395, which is rather high. Also, the planner overestimated the page cost: it assumes all rows are randomly distributed, while it is not. Should that be true, a seq scan would be much better. So basically, the query plan is trying to avoid the worst scenarios.

    @dsjoerg's problem is that there is no index on replays_playeringame(game_id). Index scan would be always used if there is an index on replays_playeringame(game_id): the cost of scanning index would become 3395+12 (or something close to that).

    @Neil suggested to have index on (player_id, game_id), which is close but not exact. The right index to have is either (game_id) or (game_id, player_id).

    0 讨论(0)
提交回复
热议问题