Here\'s the query:
SELECT top 100 a.LocationId, b.SearchQuery, b.SearchRank
FROM dbo.Locations a
INNER JOIN dbo.LocationCache b ON a.LocationId
I did a quick test and came up with the following
CREATE TABLE #Locations
(LocationID INT NOT NULL ,
CountryID INT NOT NULL ,
[Type] INT NOT NULL
CONSTRAINT PK_Locations
PRIMARY KEY CLUSTERED ( LocationID ASC )
)
CREATE NONCLUSTERED INDEX [LocationsIndex01] ON #Locations
(
CountryID ASC,
[Type] ASC
)
CREATE TABLE #LocationCache
(LocationID INT NOT NULL ,
SearchQuery VARCHAR(50) NULL ,
SearchRank INT NOT NULL
CONSTRAINT PK_LocationCache
PRIMARY KEY CLUSTERED ( LocationID ASC )
)
CREATE NONCLUSTERED INDEX [LocationCacheIndex01] ON #LocationCache
(
LocationID ASC,
SearchQuery ASC,
SearchRank ASC
)
INSERT INTO #Locations
SELECT 1,1,1 UNION
SELECT 2,1,4 UNION
SELECT 3,2,7 UNION
SELECT 4,2,7 UNION
SELECT 5,1,1 UNION
SELECT 6,1,4 UNION
SELECT 7,2,7 UNION
SELECT 8,2,7 --UNION
INSERT INTO #LocationCache
SELECT 4,'BlahA',10 UNION
SELECT 3,'BlahB',9 UNION
SELECT 2,'BlahC',8 UNION
SELECT 1,'BlahD',7 UNION
SELECT 8,'BlahE',6 UNION
SELECT 7,'BlahF',5 UNION
SELECT 6,'BlahG',4 UNION
SELECT 5,'BlahH',3 --UNION
SELECT * FROM #Locations
SELECT * FROM #LocationCache
SELECT top 3 a.LocationId, b.SearchQuery, b.SearchRank
FROM #Locations a
INNER JOIN #LocationCache b ON a.LocationId = b.LocationId
WHERE a.CountryId = 2
AND a.[Type] = 7
DROP TABLE #Locations
DROP TABLE #LocationCache
For me, the query plan shows to seeks with a nested loop inner join. If you run this, do you get both seeks? If you do, then do a test on your system and create a copy of your Locations and LocationCache table and call them say Locations2 and LocationCache2 with all the indexes and copy your data into them. Then try your query hitting the new tables?
In Short: You do not have filter on LocationCache, the whole table content should be returned. You have a fully covering index. Index SCAN (once) is the cheapest operation, and query optimizer picks it.
To optimize:
You are joining the whole tables, and later get only top 100 results. I dunno how big are they, but try to subquery the [Locations] table CountryId, Type
and then join just the result with [LocationCache]. Will be waaaay faster if you have more than 1000 rows there.
Also, try adding some more restrictive filters before joins if possible.
Index Scan: Since a scan touches every row in the table whether or not it qualifies, the cost is proportional to the total number of rows in the table. Thus, a scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.
Index Seek: Since a seek only touches rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.
If there is an index on a table, and if the query is touching a larger amount of data, which means the query is retrieving more than 50 percent or 90 percent of the data, and then optimizer would just scan all the data pages to retrieve the data rows.
source
Whilst bearing in mind that it will result in a query that may perform badly as and when additional changes are made to it, using an INNER LOOP JOIN
should force the covering index to be used on dbo.LocationCache
.
SELECT top 100 a.LocationId, b.SearchQuery, b.SearchRank
FROM dbo.Locations a
INNER LOOP JOIN dbo.LocationCache b ON a.LocationId = b.LocationId
WHERE a.CountryId = 2
AND a.Type = 7
It is using an Index Scan primarily because it is also using a Merge Join. The Merge Join operator requires two input streams that are both sorted in an order that is compatible with the Join conditions.
And it is using the Merge Join operator to realize your INNER JOIN because it believes that that will be faster than the more typical Nested Loop Join operator. And it is probably right (it usually is), by using the two indexes it has chosen, it has input streams that are both pre-sorted according your join condition (LocationID). When the input streams are pre-sorted like this, then Merge Joins are almost always faster than the other two (Loop and Hash Joins).
The downside is what you have noticed: it appears to be scanning the whole index in, so how can that be faster if it is reading so many records that may never be used? The answer is that Scans (because of their sequential nature) can read anywhere from 10 to 100 times as many records/second as seeks.
Now Seeks usually win because they are selective: they only get the rows that you ask for, whereas Scans are non-selective: they must return every row in the range. But because Scans have a much higher read rate, they can frequently beat Seeks as long as the ratio of Discarded Rows to Matching Rows is lower than the ratio of Scan rows/sec VS. Seek rows/sec.
Questions?
OK, I have been asked to explain the last sentence more:
A "Discarded Row" is one that the the Scan reads (because it has to read everything in the index), but that will be rejected by the Merge Join operator, because it does not have a match on the other side, possibly because the WHERE clause condition has already excluded it.
"Matching Rows" are the ones that it read that are actually matched to something in the Merge Join. These are the same rows that would have been read by a Seek if the Scan were replaced by a Seek.
You can figure out what there are by looking at the statistics in the Query Plan. See that huge fat arrow to the left of the Index Scan? That represents how many rows the optimizer thinks that it will read with the Scan. The statistics box of the Index Scan that you posted shows the Actual Rows returned is about 5.4M (5,394,402). This is equal to:
TotalScanRows = (MatchingRows + DiscardedRows)
(In my terms, anyway). To get the Matching Rows, look at the "Actual Rows" reported by the Merge Join operator (you may have to take off the TOP 100 to get this accurately). Once you know this, you can get the Discarded rows by:
DiscardedRows = (TotalScanRows - MatchingRows)
And now you can calculate the ratio.
Have you tried to update your statistics?
UPDATE STATISTICS dbo.LocationCache
Here are a couple of good references on what that does and why the query optimizer will choose a scan over a seek.
http://social.msdn.microsoft.com/Forums/en-CA/sqldatabaseengine/thread/82f49db8-0c77-4bce-b26c-1ad0a4af693b
Summary
There are several things to take into consideration here. Firstly, when SQL decides upon the best (good enough) plan to use, it looks at the query, and then also looks at the statistics that it stores about the tables involved.
It then decides if it is more efficient to seek down the index, or scan the whole leaf level of the index (in this case, it involves touching every page in the table, because it is a clustered index) It does this by looking at a number of things. Firstly, it guesses how many rows/pages it will need to scan. This is called the tipping point, and is a lower percentage than you may think. See this great Kimberly Tripp blog http://www.sqlskills.com/BLOGS/KIMBERLY/category/The-Tipping-Point.aspx
If you are within the limits for the tipping point, it may be because your statistics are out of date, or your index is heavily fragmented.
It is possible to force SQL to seek an index by using the FORCESEEK query hint, but please use this with caution, as generally, providing you keep everything weel maintained, SQL is pretty good at deciding what the most efficient plan will be!!