Sql: Optimizing BETWEEN clause

前端未结

关注

 9  750

孤街浪徒 2021-02-06 07:38

I wrote a statement that takes almost an hour to run so I am asking help so I can get to do this faster. So here we go:

I am making an inner join of two tables :

9条回答

不思量自难忘° (楼主)

2021-02-06 08:07

This is quite a common problem.

Plain B-Tree indexes are not good for the queries like this:

SELECT  measures.measure as measure,
        measures.time as time,
        intervals.entry_time as entry_time,
        intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
        time ASC

An index is good for searching the values within the given bounds, like this:

, but not for searching the bounds containing the given value, like this:

This article in my blog explains the problem in more detail:

Adjacency list vs. nested sets: MySQL

(the nested sets model deals with the similar type of predicate).

You can make the index on time, this way the intervals will be leading in the join, the ranged time will be used inside the nested loops. This will require sorting on time.

You can create a spatial index on intervals (available in MySQL using MyISAM storage) that would include start and end in one geometry column. This way, measures can lead in the join and no sorting will be needed.

The spatial indexes, however, are more slow, so this will only be efficient if you have few measures but many intervals.

Since you have few intervals but many measures, just make sure you have an index on measures.time:

CREATE INDEX ix_measures_time ON measures (time)

Update:

Here's a sample script to test:

BEGIN
        DBMS_RANDOM.seed(20091223);
END;
/

CREATE TABLE intervals (
        entry_time NOT NULL,
        exit_time NOT NULL
)
AS
SELECT  TO_DATE('23.12.2009', 'dd.mm.yyyy') - level,
        TO_DATE('23.12.2009', 'dd.mm.yyyy') - level + DBMS_RANDOM.value
FROM    dual
CONNECT BY
        level <= 1500
/

CREATE UNIQUE INDEX ux_intervals_entry ON intervals (entry_time)
/

CREATE TABLE measures (
        time NOT NULL,
        measure NOT NULL
)
AS
SELECT  TO_DATE('23.12.2009', 'dd.mm.yyyy') - level / 720,
        CAST(DBMS_RANDOM.value * 10000 AS NUMBER(18, 2))
FROM    dual
CONNECT BY
        level <= 1080000
/

ALTER TABLE measures ADD CONSTRAINT pk_measures_time PRIMARY KEY (time)
/

CREATE INDEX ix_measures_time_measure ON measures (time, measure)
/

This query:

SELECT  SUM(measure), AVG(time - TO_DATE('23.12.2009', 'dd.mm.yyyy'))
FROM    (
        SELECT  *
        FROM    (
                SELECT  /*+ ORDERED USE_NL(intervals measures) */
                        *
                FROM    intervals
                JOIN    measures
                ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
                ORDER BY
                        time
                )
        WHERE   rownum <= 500000
        )

uses NESTED LOOPS and returns in 1.7 seconds.

This query:

SELECT  SUM(measure), AVG(time - TO_DATE('23.12.2009', 'dd.mm.yyyy'))
FROM    (
        SELECT  *
        FROM    (
                SELECT  /*+ ORDERED USE_MERGE(intervals measures) */
                        *
                FROM    intervals
                JOIN    measures
                ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
                ORDER BY
                        time
                )
        WHERE   rownum <= 500000
        )

uses MERGE JOIN and I had to stop it after 5 minutes.

Update 2:

You will most probably need to force the engine to use the correct table order in the join using a hint like this:

SELECT  /*+ LEADING (intervals) USE_NL(intervals, measures) */
        measures.measure as measure,
        measures.time as time,
        intervals.entry_time as entry_time,
        intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
        time ASC

The Oracle's optimizer is not smart enough to see that the intervals do not intersect. That's why it will most probably use measures as a leading table (which would be a wise decision should the intervals intersect).

Update 3:

WITH    splits AS
        (
        SELECT  /*+ MATERIALIZE */
                entry_range, exit_range,
                exit_range - entry_range + 1 AS range_span,
                entry_time, exit_time
        FROM    (
                SELECT  TRUNC((entry_time - TO_DATE(1, 'J')) * 2) AS entry_range,
                        TRUNC((exit_time - TO_DATE(1, 'J')) * 2) AS exit_range,
                        entry_time,
                        exit_time
                FROM    intervals
                )
        ),
        upper AS
        (
        SELECT  /*+ MATERIALIZE */
                MAX(range_span) AS max_range
        FROM    splits
        ),
        ranges AS
        (
        SELECT  /*+ MATERIALIZE */
                level AS chunk
        FROM    upper
        CONNECT BY
                level <= max_range
        ),
        tiles AS
        (
        SELECT  /*+ MATERIALIZE USE_MERGE (r s) */
                entry_range + chunk - 1 AS tile,
                entry_time,
                exit_time
        FROM    ranges r
        JOIN    splits s
        ON      chunk <= range_span
        )
SELECT  /*+ LEADING(t) USE_HASH(m t) */
        SUM(LENGTH(stuffing))
FROM    tiles t
JOIN    measures m
ON      TRUNC((m.time - TO_DATE(1, 'J')) * 2) = tile
        AND m.time BETWEEN t.entry_time AND t.exit_time

This query splits the time axis into the ranges and uses a HASH JOIN to join the measures and timestamps on the range values, with fine filtering later.

See this article in my blog for more detailed explanations on how it works:

Oracle: joining timestamps and time intervals

0 讨论(0)

查看其它9个回答