Optimize SQL that uses between clause

前端 未结 19 1952
故里飘歌
故里飘歌 2021-01-11 18:03

Consider the following 2 tables:

Table A:
id
event_time

Table B
id
start_time
end_time

Every record in table A is mapped to exactly 1 reco

相关标签:
19条回答
  • 2021-01-11 18:30

    MySQL doesn't let you to use INDEX ORDER BY WITH RANGE in derived queries.

    That's why you'll need to create a user defined function.

    Note that if your ranges do overlap, the query will only select one (which started last).

    CREATE UNIQUE INDEX ux_b_start ON b (start_date);
    
    CREATE FUNCTION `fn_get_last_b`(event_date TIMESTAMP) RETURNS int(11)
    BEGIN
      DECLARE id INT;
      SELECT b.id
      INTO id
      FROM b
      FORCE INDEX (ux_b_start)
      WHERE b.start_time <= event_date
      ORDER BY
        b.start_time DESC
      LIMIT 1;
      RETURN id;
    END;
    
    SELECT COUNT(*) FROM a;
    
    1000
    
    
    SELECT COUNT(*) FROM b;
    
    200000
    
    SELECT *
    FROM (
      SELECT fn_get_last_b(a.event_time) AS bid,
             a.*
      FROM a
    ) ao, b FORCE INDEX (PRIMARY)
    WHERE b.id = ao.bid
      AND b.end_time >= ao.event_time
    
    1000 rows fetched in 0,0143s (0,1279s)
    
    0 讨论(0)
  • 2021-01-11 18:35

    I have made some tests for a similar problem - calculating a country based on an ip address (given as a number). Here are my data and results:

    • Table A (that contains users and IP addresses) contains about 20 records.
    • Table B (that contains the IP ranges for each country) contains about 100000 records.

    The JOIN query using "between" takes about 10 seconds; The SELECT inside a SELECT query, using "between", takes about 5.5 seconds; The SELECT inside a SELECT query, using a spatial index, takes about 6.3 seconds. The JOIN query using a spatial index takes 0 seconds!

    0 讨论(0)
  • 2021-01-11 18:37

    The only way out you have to speed up the execution of this query is by making use of indexes.

    Take care to put into an index your A.event_time and then put into another index B.start_time and B.end_time.

    If as you said this is the only one condition which binds the two entities together, I think this is the only solution you can take.

    Fede

    0 讨论(0)
  • 2021-01-11 18:39

    I can't think of the reason for you to have a table with 130.000 rows with time intervals. Anyways, there must be a good reason for such design, and if so, you have to avoid trying to compute such a join everytime. So here's my suggestion. I would add a reference to B.id in table A (A.B_ID) and use triggers to maintain consistency. Anytime you add a new record (insert trigger) or the even_time column changes (update trigger), you would recompute the reference to B that this time corresponds to. Your select statement would be reduced to a single select * from A.

    0 讨论(0)
  • 2021-01-11 18:39

    There are two caveats to my solution:

    1) You said that you can add indexes but not change the schema so I'm not sure if this would work for you or not as you can't have function based indexes in MySQL and you would need to create an extra column on Table B. 2) The other caveat to this solution is that you must be using the MyISAM engine for Table B. If you cannot use MyISAM then this solution wont work because only MyISAM is supported for Spatial Indexes.

    So, assuming that the above two aren't an issue for you, the following should work and give you good performance:

    This solution makes use of MySQL's support for Spatial Data (see documentation here). While spatial data types can be added to a variety of storage engines, only MyISAM is supported for Spatial R-Tree Indexes (see documentation here) which are needed in order to get the performance needed. One other limitation is that spatial data types only work with numerical data so you cannot use this technique with string based range queries.

    I wont go into the details of the theory behind how spatial types work and how the spatial index is useful but you should look at Jeremy Cole's explanation here in regards to how to use spatial data types and indexes for GeoIP lookups. Also look at the comments as they raise some useful points and alternative if you need raw performance and can give up some accuracy.

    The basic premise is that we can take the start/end and use the two of them to create four distinct points, one for each corner of a rectangle centered around 0,0 on a xy grid, and then do a quick lookup into the spatial index to determine if the particular point in time we care about is within the rectangle or not. As mentioned previously, see Jeremy Cole's explanation for a more thorough overview of how this works.

    In your particular case we will need to do the following:

    1) Alter the table to be a MyISAM table (note you shouldn't do this unless you are fully aware of the consequences of such a change like the lack of transactions and the table locking behavior that are associated with MyISAM).

    alter table B engine = MyISAM;
    

    2) Next we add the new column that will hold the spatial data. We will use the polygon data type as we need to be able to hold a full rectangle.

    alter table B add column time_poly polygon NOT NULL;
    

    3) Next we populate the new column with the data (please keep in mind that any processes that update or insert into table B will need to get modified to make sure they are populating the new column also). Since the start and end ranges are times, we will need to convert them to numbers with the unix_timestamp function (see documentation here for how it works).

    update B set time_poly := LINESTRINGFROMWKB(LINESTRING(
        POINT(unix_timestamp(start_time), -1),
        POINT(unix_timestamp(end_time), -1),
        POINT(unix_timestamp(end_time), 1),
        POINT(unix_timestamp(start_time), 1),
        POINT(unix_timestamp(start_time), -1)
      ));
    

    4) Next we add the spatial index to the table (as mentioned previously, this will only work for a MyISAM table and will produce the error "ERROR 1464 (HY000): The used table type doesn't support SPATIAL indexes").

    alter table B add SPATIAL KEY `IXs_time_poly` (`time_poly`);
    

    5) Next you will need to use the following select in order to make use of the spatial index when querying the data.

    SELECT A.id, B.id 
    FROM A inner join B force index (IXs_time_poly)
    ON MBRCONTAINS(B.time_poly, POINTFROMWKB(POINT(unix_timestamp(A.event_time), 0)));
    

    The force index is there to make 100% sure that MySQL will use the index for the lookup. If everything went well running an explain on the above select should show something similar to the following:

    mysql> explain SELECT A.id, B.id
        -> FROM A inner join B force index (IXs_time_poly)
        -> on MBRCONTAINS(B.time_poly, POINTFROMWKB(POINT(unix_timestamp(A.event_time), 0)));
    +----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------------------------------+
    | id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                                           |
    +----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------------------------------+
    |  1 | SIMPLE      | A     | ALL  | NULL          | NULL | NULL    | NULL |    1065 |                                                 | 
    |  1 | SIMPLE      | B     | ALL  | IXs_time_poly | NULL | NULL    | NULL | 7969897 | Range checked for each record (index map: 0x10) | 
    +----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------------------------------+
    2 rows in set (0.00 sec)
    

    Please refer to Jeremy Cole's analysis for details about the performance benefits of this method as compared with a between clause.

    Let me know if you have any questions.

    Thanks,

    -Dipin

    0 讨论(0)
  • 2021-01-11 18:40

    I'm not sure this can be optimized fully. I tried it on MySQL 5.1.30. I also added an index on {B.start_time, B.end_time} as suggested by other folks. Then I got a report from EXPLAIN, but the best I could get is a Range Access Method:

    EXPLAIN SELECT A.id, B.id FROM A JOIN B 
    ON A.event_time BETWEEN B.start_time AND B.end_time;
    
    +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
    | id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                          |
    +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
    |  1 | SIMPLE      | A     | ALL  | event_time    | NULL | NULL    | NULL |    8 |                                                | 
    |  1 | SIMPLE      | B     | ALL  | start_time    | NULL | NULL    | NULL |   96 | Range checked for each record (index map: 0x4) | 
    +----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
    

    See the note on the far right. The optimizer thinks it might be able to use the index on {B.start_time, B.end_time} but it ended up deciding not to use that index. Your results may vary, because your data distribution is more representative.

    Compare with the index usage if you compare A.event_time to a constant range:

    EXPLAIN SELECT A.id FROM A
    WHERE A.event_time BETWEEN '2009-02-17 09:00' and '2009-02-17 10:00';
    
    +----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
    | id | select_type | table | type  | possible_keys | key        | key_len | ref  | rows | Extra       |
    +----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
    |  1 | SIMPLE      | A     | range | event_time    | event_time | 8       | NULL |    1 | Using where | 
    +----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
    

    And compare with the dependent sub-query form given by @Luke and @Kibbee, which seems to make use of indexes more effectively:

    EXPLAIN SELECT A.id AS id_from_a,
        (
            SELECT B.id
            FROM B
            WHERE A.id BETWEEN B.start_time AND B.end_time
            LIMIT 0, 1
        ) AS id_from_b
    FROM A;
    
    +----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+
    | id | select_type        | table | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
    +----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+
    |  1 | PRIMARY            | A     | index | NULL          | PRIMARY | 8       | NULL |    8 | Using index | 
    |  2 | DEPENDENT SUBQUERY | B     | ALL   | start_time    | NULL    | NULL    | NULL |  384 | Using where | 
    +----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+
    

    Weirdly, EXPLAIN lists possible_keys as NULL (i.e. no indexes could be used) but then decides to use the primary key after all. Could be an idiosyncrasy of MySQL's EXPLAIN report?

    0 讨论(0)
提交回复
热议问题