Improve performance of first query

自闭症网瘾萝莉.ら 提交于 2019-12-05 08:12:36
totten

Postgres is providing you a chance to do some configuration on runtime query executing for deciding your I/O operation priority.

random_page_cost(floating point) -(reference) is what may help you. It will basically set your IO/CPU operation ratio.

Higher value means I/O is important, I have sequential disk; and lower value means I/O is not important, I have random-access disk.

Default value is 4.0, and may be you want to increase and test if your query take shorter time.

Do not forget, your I/O priority will depend on your column count, row count.

A big BUT; since your indicies are btree, your CPU priority is going down much faster than I/O priorities going up. You can basically map complexities to priorities.

CPU Priority = O(log(x))
I/O Priority = O(x)

All in all, this means, if Postgre's value 4.0 would for 100k entries, You should set it to (approx.) (4.0 * log(100k) * 10M)/(log(10M) * 100k) for 10M entry.

Agree with Julius but, if you only need stuff from foo3_beleg, try EXISTS in instead (and it would help if you'd pasted your sql too, not just your explain plan).

select ...
from foo3_beleg b
where exists
(select 1 from foo_text s where t.beleg_id = b.id)
....

However, I suspect your "wake up" on the 1st pass is just your db loading up the IN subquery rows into memory. That will likely happen regardless, though an EXISTS is generally much faster than an IN (INs are rarely needed, if not containing hardcoded lists, and a yellow flag if I review sql).

Ludovic Feltz

The first time you execute the query, postgres will load the data from the disk which is slow even with a good hard drive. The second time you run your query it will load the previously loaded data from the RAM which is obviously faster.

The solution to this problem would be to load relation data into either the operating system buffer cache or the PostgreSQL buffer cache with:

int8 pg_prewarm(regclass, mode text default 'buffer', fork text default 'main', first_block int8 default null, last_block int8 default null) :

The first argument is the relation to be prewarmed. The second argument is the prewarming method to be used, as further discussed below; the third is the relation fork to be prewarmed, usually main. The fourth argument is the first block number to prewarm (NULL is accepted as a synonym for zero). The fifth argument is the last block number to prewarm (NULL means prewarm through the last block in the relation). The return value is the number of blocks prewarmed.

There are three available prewarming methods. prefetch issues asynchronous prefetch requests to the operating system, if this is supported, or throws an error otherwise. read reads the requested range of blocks; unlike prefetch, this is synchronous and supported on all platforms and builds, but may be slower. buffer reads the requested range of blocks into the database buffer cache.

Note that with any of these methods, attempting to prewarm more blocks than can be cached — by the OS when using prefetch or read, or by PostgreSQL when using buffer — will likely result in lower-numbered blocks being evicted as higher numbered blocks are read in. Prewarmed data also enjoys no special protection from cache evictions, so it is possible for other system activity may evict the newly prewarmed blocks shortly after they are read; conversely, prewarming may also evict other data from cache. For these reasons, prewarming is typically most useful at startup, when caches are largely empty.

Source

Hope this helped !

Sometimes moving an "WHERE x IN" into a JOIN can improve performance significantly. Try this:

SELECT
  foo3_beleg.id, ...
FROM
  foo3_beleg b INNER JOIN
  foo3_text  t ON (t.beleg_id = b.id AND t.content @@ 'footown'::tsquery)
WHERE 
  foo3_beleg.belegart_id IN ('...', ...);

Here's a repeatable experiment to support my claim.

I happen to have a big Postgres database handy (30 million rows) (http://juliusdavies.ca/2013/j.emse/bertillonage/), so I loaded that into postgres 9.4beta3.

The results are impressive. The sub-select approach is approximately 20 times slower:

time  psql myDb < using-in.sql
real    0m17.212s

time  psql myDb < using-join.sql
real    0m0.807s

For those interested in replicating, here are the raw SQL queries I used to test my theory.

This query uses a "SELECT IN" subquery, and it's 20 times slower (17 seconds on my laptop on the first execution):

  -- using-in.sql
  SELECT
    COUNT(DISTINCT sigsha1re) AS a_intersect_b, infilesha1
  FROM
    files INNER JOIN sigs  ON (files.filesha1 = sigs.filesha1)
  WHERE
    sigs.sigsha1re IN (
      SELECT sigsha1re FROM sigs WHERE sigs.sigsha1re like '0347%'
    )  
  GROUP BY
    infilesha1

This query moves the condition out of the subquery and into the joining criteria, and it's 20 times faster (0.8 seconds on my laptop on the first execution).

  -- using-join.sql
  SELECT
    COUNT(DISTINCT sigsha1re) AS a_intersect_b, infilesha1
  FROM
    files INNER JOIN sigs  ON (
      files.filesha1 = sigs.filesha1 AND sigs.sigsha1re like '0347%'
    )
  GROUP BY
    infilesha1

p.s. if you're curious what that database is for, you can use it to calculate how similar an arbitrary jar file is to all of the jar files in the maven repository circa 2011.

./query.sh lib/commons-codec-1.5.jar | psql myDb

 similarity |                      a = 39 = commons-codec-1.5.jar  (bin2bin)                       
------------+--------------------------------------------------------------------------------------
  1.000     | commons-codec-1.5.jar
  0.447     | commons-codec-1.4.jar
  0.174     | org.apache.sling.auth.form-1.0.2.jar
  0.170     | org.apache.sling.auth.form-1.0.0.jar
  0.142     | jbehave-core-3.0-beta-3.jar
  0.142     | jbehave-core-3.0-beta-4.jar
  0.141     | jbehave-core-3.0-beta-5.jar
  0.141     | jbehave-core-3.0-beta-6.jar
  0.140     | commons-codec-1.2.jar
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!