I have been working on speeding up a query I\'m using for about a week now and asked several questions about it here ( How can I speed up fetching the results after running an s
The execution time geometrically proportional to the number of rows in each table rather than arithmetically e.g.
3 tables with 10 rows each => 1,000 comparision
3 tables with 10, 10 and 40 rows => 4,000 comparisons
3 tables with 20 rows each => 8,000 comparisons
You could probably re-factor the query to avoid some of the joins/cursors - when do you need an answer?
Could you do something like this:
SELECT precursor_id, feature_table_id
FROM MSMS_precursor
INNER JOIN
(
SELECT mzMin, mzMax, rtMin, rtMax, spectrum_id, feature_table_id, msrun_msrun_id
FROM spectrum
INNER JOIN
(select feature_table_id, mzMin, mzMax, rtMin, rtMax, msrun_msrun_id
from feature
where feature.msrun_msrun_id = 'value'
) subquery
ON subquery.msrun_msrun_id = spectrum.msrun_msrun_id
WHERE
spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax
) subquery
ON subquery.spectrum_id = MSMS_precursor.spectrum_spectrum_id
WHERE
MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax
Using a subquery enables you to reduce the number of comparisons between the tables - you can quickly filter out the unwanted features, then the un-related spectra before searching for suitable precursors.
I don't use SQLLite - but the principle should still apply.
UPDATED : fixed bug in SQL
Notes:
You don't have to worry about the ANDs, you'll only get:
UPDATE 18/May:
It's the indexing!!! you have indexes on the search fields, but not on the fields participating in the joins - foreign key indices really boost performance:
CREATE INDEX `fk_msrun_msrun_id_feature` ON `feature` (`msrun_msrun_id` ASC);
CREATE INDEX `fk_spectrum_spectrum_id_feature` ON `feature` (`msrun_msrun_id` ASC);
CREATE INDEX `fk_spectrum_spectrum_id_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id` ASC);
Consider using covering indices on the tables involved in your query.
You are indeed fetching a limited amount of columns in your select
statement and the corresponding inner join
and where
clauses. By using a covering index with the columns well ordered in it, you should get a very fast query, that is you will remove the scan table
in favor of an search table using a covering index.
Try to use those indices on your tables:
CREATE INDEX `fk_covering_feature` ON `feature` (`msrun_msrun_id`,`mzMin`,`mzMax`,`rtMin`,`rtMax`,`feature_table_id`);
CREATE INDEX `fk_covering_spectrum` ON `spectrum` (`msrun_msrun_id`,`scan_start_time`,`spectrum_id`);
CREATE INDEX `fk_covering_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id`,`ion_mz`,`precursor_id`);
When going for speed, you should also hint the query planner to understand msrun_msrun_id is a constant to check for both feature
and spectrum
tables. Add the constant test in your query by putting this additional test at the end of the query (and pass spectrumFeature_InputValues
twice):
"AND spectrum.msrun_msrun_id = ?"
I suggest you try using an R*Tree index, They're designed for efficient range queries.
I haven't actually used R*Tree much, just read the documentation, but I think you might be using it incorrectly. You may want to try changing your query to use
WHERE convexhull_edges.rtMin <= spectrum.scan_start_time AND convexhull_edges.rtMax >= spectrum.scan_start_time AND
convexhull_edges.mzMin <= MSMS_precursor.ion_mz AND convexhull_edges.mzMax >= MSMS_precursor.ion_mz
which should be equivalent to your current query, but I think should be faster (You should be picking a range out of the R*Tree, rather than comparing the point to the range)