问题
In Derby and/or MySQL, I am wondering about the performance of:
select c0.documentid from contentblob as c0 where c0.documentid > ?
and c0.blobid = ?
and exists (select * from contentblob c1 where c1.documentid = c0.documentid
and c1.blobid = ?)
order by c0.documentid
Note that DIFFERENT values will be supplied for the two blobid comparisons. What this query does is return the id's that occur in the table (at least) twice: once with blobid1 and once with blobid2.
compared to the equivalent self-join,
assuming, of course, indices on the documentid and blobid columns of the contentblob table.
In case it isn't clear, the relevant schema here is a table with two varchar columns, documentid and blobid. They are composed to form the primary key.
If there's nothing for it but to inquire of the query analyzer in the db, I'll go do that.
回答1:
Generally, correlated sub-queries are more expensive than joins, even outer self-joins, though in this case all you need is a regular inner self-join:
SELECT c0.documentid
FROM contentblob AS c0
JOIN contentblob AS c1 ON c1.documentid = c0.documentid
WHERE c0.documentid > ?
AND c0.blobid = ?
AND c1.blobid = ?
AND c0.blobid < c1.blobid
ORDER BY c0.documentid
The last condition avoids seeing two entries for a pair of rows with the same document ID and the relevant blob ID values; without it, you'd see the document ID twice, once for c0.blobid = v1 and once for c0.blobid = v2.
You might want to move the c1.blobid condition into the ON clause, but the optimizer should do that for you anyway. Also check that your optimizer is using an appropriate index. It isn't entirely clear what that index would be (but I think an index with blobid as the leading column would probably perform best), but if the tables are large and the optimizer is using a sequential scan of the table, you have problems.
回答2:
I think the left join would be clearer
select c0.documentid from contentblob as c0 where c0.documentid > ?
and c0.blobid = ?
left join contentblob c1 on c0.documentid = c1.documentid
and c1.blobid = ?
order by c0.documentid
Mileage will vary for different database engines, and, will also vary with table sizes, index cardinality etc. but my gut feeling is the join would perform better.
来源:https://stackoverflow.com/questions/4457132/joins-versus-correlated-exists-subqueries