Why is the same SQLite query being 30 times slower when fetching only twice as many results?

前端 未结 3 2040
清酒与你
清酒与你 2021-02-01 15:16

I have been working on speeding up a query I\'m using for about a week now and asked several questions about it here ( How can I speed up fetching the results after running an s

相关标签:
3条回答
  • 2021-02-01 15:57

    The execution time geometrically proportional to the number of rows in each table rather than arithmetically e.g.

    3 tables with 10 rows each => 1,000 comparision
    
    3 tables with 10, 10 and 40 rows => 4,000 comparisons
    
    3 tables with 20 rows each => 8,000 comparisons
    

    You could probably re-factor the query to avoid some of the joins/cursors - when do you need an answer?

    Could you do something like this:

    SELECT precursor_id, feature_table_id 
    FROM MSMS_precursor
    INNER JOIN 
    
        (
            SELECT mzMin, mzMax, rtMin, rtMax, spectrum_id, feature_table_id, msrun_msrun_id
    
            FROM spectrum
            INNER JOIN 
    
               (select feature_table_id, mzMin, mzMax, rtMin, rtMax, msrun_msrun_id
                from feature
                where feature.msrun_msrun_id = 'value'
               ) subquery 
    
            ON subquery.msrun_msrun_id = spectrum.msrun_msrun_id
            WHERE 
                spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax 
        ) subquery
    
        ON subquery.spectrum_id = MSMS_precursor.spectrum_spectrum_id 
    
    WHERE 
        MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax 
    

    Using a subquery enables you to reduce the number of comparisons between the tables - you can quickly filter out the unwanted features, then the un-related spectra before searching for suitable precursors.

    I don't use SQLLite - but the principle should still apply.

    UPDATED : fixed bug in SQL

    Notes:

    You don't have to worry about the ANDs, you'll only get:

    • features where feature.msrun_msrun_id = 'value'
    • spectra for those features and where spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax
    • precursors for those spectrs and where MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax

    UPDATE 18/May:

    It's the indexing!!! you have indexes on the search fields, but not on the fields participating in the joins - foreign key indices really boost performance:

    CREATE INDEX `fk_msrun_msrun_id_feature` ON `feature` (`msrun_msrun_id` ASC); 
    CREATE INDEX `fk_spectrum_spectrum_id_feature` ON `feature` (`msrun_msrun_id` ASC); 
    CREATE INDEX `fk_spectrum_spectrum_id_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id` ASC); 
    
    0 讨论(0)
  • 2021-02-01 16:08

    Consider using covering indices on the tables involved in your query.

    You are indeed fetching a limited amount of columns in your select statement and the corresponding inner join and where clauses. By using a covering index with the columns well ordered in it, you should get a very fast query, that is you will remove the scan table in favor of an search table using a covering index.

    Try to use those indices on your tables:

    CREATE INDEX `fk_covering_feature` ON `feature` (`msrun_msrun_id`,`mzMin`,`mzMax`,`rtMin`,`rtMax`,`feature_table_id`);
    CREATE INDEX `fk_covering_spectrum` ON  `spectrum` (`msrun_msrun_id`,`scan_start_time`,`spectrum_id`);
    CREATE INDEX `fk_covering_MSMS_precursor` ON  `MSMS_precursor` (`spectrum_spectrum_id`,`ion_mz`,`precursor_id`);
    

    When going for speed, you should also hint the query planner to understand msrun_msrun_id is a constant to check for both feature and spectrum tables. Add the constant test in your query by putting this additional test at the end of the query (and pass spectrumFeature_InputValues twice):

    "AND spectrum.msrun_msrun_id = ?"
    
    0 讨论(0)
  • 2021-02-01 16:09

    I suggest you try using an R*Tree index, They're designed for efficient range queries.


    I haven't actually used R*Tree much, just read the documentation, but I think you might be using it incorrectly. You may want to try changing your query to use

    WHERE convexhull_edges.rtMin <= spectrum.scan_start_time AND convexhull_edges.rtMax >= spectrum.scan_start_time AND
    convexhull_edges.mzMin <= MSMS_precursor.ion_mz AND convexhull_edges.mzMax >= MSMS_precursor.ion_mz
    

    which should be equivalent to your current query, but I think should be faster (You should be picking a range out of the R*Tree, rather than comparing the point to the range)

    0 讨论(0)
提交回复
热议问题