Optimize slow aggregates in LATERAL join

夙愿已清 提交于 2019-12-23 23:30:16

问题


In my PostgreSQL 9.6.2 database, I've got a query that builds a table of calculated fields from some stock data. It calculates a moving average window of 1 through 10 years for each row in the table, and uses that in cyclical adjustments. CAPE, CAPB, CAPC, CAPS, and CAPD, specifically.

I.e., for each row, you calculate avg(earnings) for the past 1 through 10 years, then do the same for several other variables.

I'm currently using a lateral join to compute the aggregates for each row, but things are incredibly slow and I'm not entirely sure how to speed it up, whether it be indexing / rewriting the query, etc.

For instance when I stratify the query to only include ~ 25k rows, it takes 15 minutes to run which seems way too slow. (RDS on AWS Free Tier)

-- Initialize Cyclical Adjustments
-- This query populates the database with numerous peak/min and CAPE type calculations.
-- We do this by selecting each valid row within the table by security then laterally
-- selecting the calculations for each of those rows. 'Valid' rows are determined by
-- date calculations that make sure every field that has insufficient data behind it
-- (several queries want 5+ years of time series data) is filled with NULL to avoid
-- inaccuracies.

WITH earliest_point AS (
    SELECT
      security_id,
      min(date) as min_date
    FROM bloomberg.security_data
    GROUP BY security_id
)
SELECT
  rec.record_id,
  rec.security_id,
  date,
  -- Each of these cases decides if we have enough data in the database to populate the field. If there are at least
  -- x years in the database (where x = 1:10) we do the price / aggregate computation. Otherwise, we shortcut to NULL.
  -- NOTE: The NULLIF prevents us from dividing by zero.
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_earnings, 0) ELSE NULL END AS price_to_peak_earnings,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_earnings, 0) ELSE NULL END AS price_to_minimum_earnings,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_book, 0) ELSE NULL END AS price_to_peak_book,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_book, 0) ELSE NULL END AS price_to_minimum_book,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_sales, 0) ELSE NULL END AS price_to_peak_sales,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_sales, 0) ELSE NULL END AS price_to_minimum_sales,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_cashflow, 0) ELSE NULL END AS price_to_peak_cashflow,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_cashflow, 0) ELSE NULL END AS price_to_minimum_cashflow,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.max_dividends, 0) ELSE NULL END AS price_to_peak_dividends,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.min_dividends, 0) ELSE NULL END AS price_to_minimum_dividends,
  CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL  THEN price / NULLIF(ru.cap1_avg_earnings, 0) ELSE NULL END AS cape1,
  CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL  THEN price / NULLIF(ru.cap2_avg_earnings, 0) ELSE NULL END AS cape2,
  CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL  THEN price / NULLIF(ru.cap2_avg_earnings, 0) ELSE NULL END AS cape2,
  CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL  THEN price / NULLIF(ru.cap3_avg_earnings, 0) ELSE NULL END AS cape3,
  CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL  THEN price / NULLIF(ru.cap4_avg_earnings, 0) ELSE NULL END AS cape4,
  CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL  THEN price / NULLIF(ru.cap5_avg_earnings, 0) ELSE NULL END AS cape5,
  CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL  THEN price / NULLIF(ru.cap6_avg_earnings, 0) ELSE NULL END AS cape6,
  CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL  THEN price / NULLIF(ru.cap7_avg_earnings, 0) ELSE NULL END AS cape7,
  CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL  THEN price / NULLIF(ru.cap8_avg_earnings, 0) ELSE NULL END AS cape8,
  CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL  THEN price / NULLIF(ru.cap9_avg_earnings, 0) ELSE NULL END AS cape9,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_earnings, 0) ELSE NULL END AS cape10,
  CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL  THEN price / NULLIF(ru.cap1_avg_book, 0) ELSE NULL END AS capb1,
  CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL  THEN price / NULLIF(ru.cap2_avg_book, 0) ELSE NULL END AS capb2,
  CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL  THEN price / NULLIF(ru.cap3_avg_book, 0) ELSE NULL END AS capb3,
  CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL  THEN price / NULLIF(ru.cap4_avg_book, 0) ELSE NULL END AS capb4,
  CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL  THEN price / NULLIF(ru.cap5_avg_book, 0) ELSE NULL END AS capb5,
  CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL  THEN price / NULLIF(ru.cap6_avg_book, 0) ELSE NULL END AS capb6,
  CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL  THEN price / NULLIF(ru.cap7_avg_book, 0) ELSE NULL END AS capb7,
  CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL  THEN price / NULLIF(ru.cap8_avg_book, 0) ELSE NULL END AS capb8,
  CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL  THEN price / NULLIF(ru.cap9_avg_book, 0) ELSE NULL END AS capb9,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_book, 0) ELSE NULL END AS capb10,
  CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL  THEN price / NULLIF(ru.cap1_avg_sales, 0) ELSE NULL END AS caps1,
  CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL  THEN price / NULLIF(ru.cap2_avg_sales, 0) ELSE NULL END AS caps2,
  CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL  THEN price / NULLIF(ru.cap3_avg_sales, 0) ELSE NULL END AS caps3,
  CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL  THEN price / NULLIF(ru.cap4_avg_sales, 0) ELSE NULL END AS caps4,
  CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL  THEN price / NULLIF(ru.cap5_avg_sales, 0) ELSE NULL END AS caps5,
  CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL  THEN price / NULLIF(ru.cap6_avg_sales, 0) ELSE NULL END AS caps6,
  CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL  THEN price / NULLIF(ru.cap7_avg_sales, 0) ELSE NULL END AS caps7,
  CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL  THEN price / NULLIF(ru.cap8_avg_sales, 0) ELSE NULL END AS caps8,
  CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL  THEN price / NULLIF(ru.cap9_avg_sales, 0) ELSE NULL END AS caps9,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_sales, 0) ELSE NULL END AS caps10,
  CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL  THEN price / NULLIF(ru.cap1_avg_cashflow, 0) ELSE NULL END AS capc1,
  CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL  THEN price / NULLIF(ru.cap2_avg_cashflow, 0) ELSE NULL END AS capc2,
  CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL  THEN price / NULLIF(ru.cap3_avg_cashflow, 0) ELSE NULL END AS capc3,
  CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL  THEN price / NULLIF(ru.cap4_avg_cashflow, 0) ELSE NULL END AS capc4,
  CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL  THEN price / NULLIF(ru.cap5_avg_cashflow, 0) ELSE NULL END AS capc5,
  CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL  THEN price / NULLIF(ru.cap6_avg_cashflow, 0) ELSE NULL END AS capc6,
  CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL  THEN price / NULLIF(ru.cap7_avg_cashflow, 0) ELSE NULL END AS capc7,
  CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL  THEN price / NULLIF(ru.cap8_avg_cashflow, 0) ELSE NULL END AS capc8,
  CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL  THEN price / NULLIF(ru.cap9_avg_cashflow, 0) ELSE NULL END AS capc9,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_cashflow, 0) ELSE NULL END AS capc10,
  CASE WHEN ep.min_date < rec.date - '1 years'::INTERVAL  THEN price / NULLIF(ru.cap1_avg_dividends, 0) ELSE NULL END AS capd1,
  CASE WHEN ep.min_date < rec.date - '2 years'::INTERVAL  THEN price / NULLIF(ru.cap2_avg_dividends, 0) ELSE NULL END AS capd2,
  CASE WHEN ep.min_date < rec.date - '3 years'::INTERVAL  THEN price / NULLIF(ru.cap3_avg_dividends, 0) ELSE NULL END AS capd3,
  CASE WHEN ep.min_date < rec.date - '4 years'::INTERVAL  THEN price / NULLIF(ru.cap4_avg_dividends, 0) ELSE NULL END AS capd4,
  CASE WHEN ep.min_date < rec.date - '5 years'::INTERVAL  THEN price / NULLIF(ru.cap5_avg_dividends, 0) ELSE NULL END AS capd5,
  CASE WHEN ep.min_date < rec.date - '6 years'::INTERVAL  THEN price / NULLIF(ru.cap6_avg_dividends, 0) ELSE NULL END AS capd6,
  CASE WHEN ep.min_date < rec.date - '7 years'::INTERVAL  THEN price / NULLIF(ru.cap7_avg_dividends, 0) ELSE NULL END AS capd7,
  CASE WHEN ep.min_date < rec.date - '8 years'::INTERVAL  THEN price / NULLIF(ru.cap8_avg_dividends, 0) ELSE NULL END AS capd8,
  CASE WHEN ep.min_date < rec.date - '9 years'::INTERVAL  THEN price / NULLIF(ru.cap9_avg_dividends, 0) ELSE NULL END AS capd9,
  CASE WHEN ep.min_date < rec.date - '10 years'::INTERVAL THEN price / NULLIF(ru.cap10_avg_dividends, 0) ELSE NULL END AS capd10
FROM bloomberg.security_data rec
  -- Include the earliest point we have for this security in the record
  JOIN earliest_point ep ON ep.security_id = rec.security_id,
  -- LATERAL SELECT is executed for each row in the above query, with the row (rec) as a parameter
  LATERAL
  (
  SELECT
    -- Price to Peak/Minimum <field> calculations
    max(earnings)  AS max_earnings,
    min(earnings)  AS min_earnings,
    max(book)      AS max_book,
    min(book)      AS min_book,
    max(sales)     AS max_sales,
    min(sales)     AS min_sales,
    max(cashflow)  AS max_cashflow,
    min(cashflow)  AS min_cashflow,
    max(dividends) AS max_dividends,
    min(dividends) AS min_dividends,

    -- Each of the following computes the aggregates for the
    -- CAPE/B/S/C/D cyclical adjustments.
    avg(earnings)  FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '1 years'::interval) AS cap1_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '2 years'::interval) AS cap2_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '3 years'::interval) AS cap3_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '4 years'::interval) AS cap4_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '5 years'::interval) AS cap5_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '6 years'::interval) AS cap6_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '7 years'::interval) AS cap7_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '8 years'::interval) AS cap8_avg_dividends,

    avg(earnings)  FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_earnings,
    avg(book)      FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_book,
    avg(sales)     FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_sales,
    avg(cashflow)  FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_cashflow,
    avg(dividends) FILTER (WHERE date >= rec.date - '9 years'::interval) AS cap9_avg_dividends,

    avg(earnings)  AS cap10_avg_earnings,
    avg(book)      AS cap10_avg_book,
    avg(sales)     AS cap10_avg_sales,
    avg(cashflow)  AS cap10_avg_cashflow,
    avg(dividends) AS cap10_avg_dividends

  FROM bloomberg.security_data DATA
  WHERE security_id = rec.security_id
    AND date >= rec.date - '10 years'::interval
    AND date <= rec.date
  ) ru;

Any ideas on how to make this faster would be greatly appreciated as I'm new to PostgreSQL.

Here is the database setup for reference:

CREATE SCHEMA bloomberg;

CREATE TABLE bloomberg.securities (
  security_id character varying(45) PRIMARY KEY,
  name_short character varying(45) NOT NULL,
  name character varying(45) NOT NULL,
  name_security character varying(45) NOT NULL
);

CREATE TABLE bloomberg.security_data (
  record_id bigserial PRIMARY KEY,
  date date NOT NULL,
  security_id character varying(45) NOT NULL,
  price double precision,
  total_return double precision,
  earnings double precision,
  book double precision,
  sales double precision,
  cashflow double precision,
  dividends double precision,
  CONSTRAINT security_id FOREIGN KEY (security_id)
  REFERENCES bloomberg.securities (security_id) MATCH SIMPLE
  ON UPDATE CASCADE
  ON DELETE CASCADE
);

CREATE INDEX security_data_data on bloomberg.security_data (date);
CREATE INDEX security_data_security_id on bloomberg.security_data (security_id);

回答1:


This should be a faster variant with LATERAL subqueries. Untested.

SELECT s.record_id, s.security_id, s.date
     , s.price / l.pmax   AS price_to_peak_earnings
     , s.price / l.pmin   AS price_to_minimum_earnings
  -- , ...
     , s.price / l.cape1  AS cape1
     , s.price / l.cape2  AS cape2
  -- , ...
     , s.price / l.cape10 AS cape10
     , s.price / l.capb1  AS capb1
     , s.price / l.capb2  AS capb2
  -- , ...
     , s.price / l.capb10 AS capb10
  -- , ...
FROM  (
   SELECT *
        , (date - interval  '1 y')::date AS date1
        , (date - interval  '2 y')::date AS date2
        -- ...
        , (date - interval '10 y')::date AS date10
   FROM  (
      SELECT *, min(date) OVER (PARTITION BY security_id) AS min_date
      FROM   security_data
      ) s1
   ) s
LEFT   JOIN LATERAL (
   SELECT CASE WHEN s.date10 >= s.min_date THEN NULLIF(max(earnings)                               , 0) END AS pmax
        , CASE WHEN s.date10 >= s.min_date THEN NULLIF(min(earnings)                               , 0) END AS pmin
        -- ...
        ,                                       NULLIF(avg(earnings) FILTER (WHERE date >= s.date1), 0)     AS cape1   -- no case
        , CASE WHEN s.date2  >= s.min_date THEN NULLIF(avg(earnings) FILTER (WHERE date >= s.date2), 0) END AS cape2
        -- ...
        , CASE WHEN s.date10 >= s.min_date THEN NULLIF(avg(earnings)                               , 0) END AS cape10  -- no filter

        ,                                       NULLIF(avg(book)     FILTER (WHERE date >= s.date1), 0)     AS capb1
        , CASE WHEN s.date2  >= s.min_date THEN NULLIF(avg(book)     FILTER (WHERE date >= s.date2), 0) END AS capb2
        -- ...
        , CASE WHEN s.date10 >= s.min_date THEN NULLIF(avg(book)                                   , 0) END AS capb10
        -- ...
   FROM   security_data 
   WHERE  security_id = s.security_id
   AND    date >= s.date10
   AND    date <  s.date
   ) l ON s.date1 >= s.min_date  -- no computations if < 1 year of trailing data
ORDER  BY s.security_id, s.date;

It's still not going to be blazingly fast, since every row needs multiple separate aggregations. The bottleneck here will be CPU.

Also see the follow up with an alternative approach (JOIN to generated calendar + window functions):

  • Window functions filter through current row


来源:https://stackoverflow.com/questions/48000013/optimize-slow-aggregates-in-lateral-join

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!