A SQL query searching for rows that satisfy Column1 <= X <= Column2 is very slow

后端 未结 12 1331
盖世英雄少女心
盖世英雄少女心 2021-01-11 16:27

I am using a MySQL DB, and have the following table:

CREATE TABLE SomeTable (
  PrimaryKeyCol BIGINT(20) NOT NULL,
  A BIGINT(20) NOT NULL,
  FirstX INT(11) N         


        
相关标签:
12条回答
  • 2021-01-11 17:10

    Another approach is to precalculate the solutions, if that number isn't too big.

    CREATE TABLE SomeTableLookUp (
        X INT NOT NULL
        PrimaryKeyCol BIGINT NOT NULL,
        PRIMARY KEY(X, PrimaryKeyCol)
    );
    

    And now you just pre-populate your constant table.

    INSERT INTO SomeTableLookUp
    SELECT X, PrimaryKeyCol
    FROM SomeTable
    JOIN (
       SELECT DISTINCT X FROM SomeTable 
    ) XS
    WHERE XS.X BETWEEN StartX AND EndX 
    

    And now you can SELECT your answers directly.

    SELECT SomeTable.*
    FROM SomeTableLookup
    JOIN SomeTable
    ON SomeTableLookup.PrimaryKeyCol = SomeTable.PrimaryKeyCol
    WHERE SomeTableLookup = ?
    LIMIT 10
    
    0 讨论(0)
  • 2021-01-11 17:11

    I found a solution that relies on properties of the data in the table. I would rather have a more general solution that doesn't depend on the current data, but for the time being that's the best I have.

    The problem with the original query:

    SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;
    

    is that the execution may require scanning a large percentage of the entries in the FirstX,LastX,P index when the first condition FirstX <= ? is satisfied by a large percentage of the rows.

    What I did to reduce the execution time is observe that LastX-FirstX is relatively small.

    I ran the query:

    SELECT MAX(LastX-FirstX) FROM SomeTable;
    

    and got 4200000.

    This means that FirstX >= LastX – 4200000 for all the rows in the table.

    So in order to satisfy LastX >= ?, we must also satisfy FirstX >= ? – 4200000.

    So we can add a condition to the query as follows:

    SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND FirstX >= ? - 4200000 AND LastX >= ? LIMIT 10;
    

    In the example I tested in the question, the number of index entries processed was reduced from 2104820 to 18 and the running time was reduced from 0.563 seconds to 0.0003 seconds.

    I tested the new query with the same 120000 values of X. The output was identical to the old query. The time went down from over 10 hours to 5.5 minutes, which is over 100 times faster.

    0 讨论(0)
  • 2021-01-11 17:12

    Edit: Idea #2

    Do you have control over the Java app? Because, honestly, 0.3 seconds for an index scan is not bad. Your problem is that you're trying to get a query, run 120,000 times, to have a reasonable end time.

    If you do have control over the Java app, you could either have it submit all the X values at once - and let SQL not have to do an index scan 120k times. Or you could even just program the logic on the Java side, since it would be relatively easy to optimize.

    Original Idea:

    Have you tried creating a Multiple-Column index?

    The problem with having multiple indexes is that each index is only going to narrow it down to ~50% of the records - it has to then match those ~2 million rows of Index A against ~2 million rows of Index B.

    Instead, if you get both columns in the same index, the SQL engine can first do a Seek operation to get to the start of the records, and then do a single Index Scan to get the list of records it needs. No matching one index against another.

    I'd suggest not making this the Clustered Index, though. The reason for that? You're not expecting many results, so matching the Index Scan's results against the table isn't going to be time consuming. Instead, you want to make the Index as small as possible, so that the Index Scan goes as fast as possible. Clustered Indexes are the table - so a Clustered Index is going to have the same Scan speed as the table itself. Along the same lines, you probably don't want any other fields other than FirstX and LastX in your index - make that Index as tiny as you can, so that the scan flies along.

    Finally, like you're doing now, you're going to need to clue the engine in that you're not expecting a large set of data back from the search - you want to make sure it's using that compact Index for its scan (instead of it saying, "Eh, I'd be better off just doing a full table scan.)

    0 讨论(0)
  • 2021-01-11 17:12

    Suppose you got the execution time down to 0.1 seconds. Would the resulting 3 hours, twenty minutes be acceptable?

    The simple fact is that thousands of calls to the same query is incredibly inefficient. Quite aside from what the database has to endure, there is network traffic to think of, disk seek times and all kinds of processing overhead.

    Supposing that you don't already have the 120,000 values for x in a table, that's where I would start. I would insert them into a table in batches of 500 or so at a time:

    insert into xvalues (x)
    select 14 union all
    select 18 union all
    select 42 /* and so on */
    

    Then, change your query to join to xvalues.

    I reckon that optimisation alone will get your run-time down to minutes or seconds instead of hours (based on many such optimisations I have done through the years).

    It also opens up the door for further optimisations. If the x values are likely to have at least some duplicates (say, at least 20% of values occur more than once) it may be worth investigating a solution where you only run the query for unique values and do the insert into SomeTable for every x with the matching value.

    As a rule: anything you can do in bulk is likely to exponentially outperform anything you do row by row.

    PS:

    You referred to a query, but a stored procedure can also work with an input table. In some RDBMSs you can pass a table as parameter. I don't think that works in MySQL, but you can create a temporary table that the calling code fills in and the stored procedure joins to. Or a permanent table used in the same way. The major drawback of not using a temp table, is that you may need to concern yourself with session management or discarding stale data. Only you will know if that is applicable to your case.

    0 讨论(0)
  • 2021-01-11 17:16

    To optimize this query:

    SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;

    Here's 2 resources you can use:

    • descending indexes
    • spatial indexes

    Descending indexes:

    One option is to use an index that is descending on FirstX and ascending on LastX.

    https://dev.mysql.com/doc/refman/8.0/en/descending-indexes.html

    something like:

    CREATE INDEX SomeIndex on SomeTable (FirstX DESC, LastX);

    Conversely, you could create instead the index (LastX, FirstX DESC).

    Spatial indexes:

    Another option is to use a SPATIAL INDEX with (FirstX, LastX). If you think of FirstX and LastX as 2D spatial coordinates, then your search what it does is select the points in a contiguous geographic area delimited by the lines FirstX<=LastX, FirstX>=0, LastX>=X.

    Here's a link on spatial indexes (not specific to MySQL, but with drawings):

    https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview

    0 讨论(0)
  • 2021-01-11 17:18

    One way might be to partition the table by different ranges then only querying stuff that fit into a range hence making the amount it needs to check much smaller. This might not work since the java may be slower. But it might put less stress on the database. There might be a way also to not Query the database so many times and have a more inclusive SQL(you might be able to send a list of values and have the sql send it to a different table).

    0 讨论(0)
提交回复
热议问题