Efficient SQL 2000 Query for Selecting Preferred Candy

南楼画角 提交于 2019-12-23 21:19:36

问题


(I wish I could have come up with a more descriptive title... suggest one or edit this post if you can name the type of query I'm asking about)

Database: SQL Server 2000

Sample Data (assume 500,000 rows):

Name   Candy       PreferenceFactor
Jim    Chocolate   1.0
Brad   Lemon Drop   .9
Brad   Chocolate    .1
Chris  Chocolate    .5
Chris  Candy Cane   .5
499,995 more rows...

Note that the number of rows with a given 'Name' is unbounded.

Desired Query Results:

Jim    Chocolate   1.0
Brad   Lemon Drop   .9
Chris  Chocolate    .5
~250,000 more rows...

(Since Chris has equal preference for Candy Cane and Chocolate, a consistent result is adequate).

Question: How do I Select Name, Candy from data where each resulting row contains a unique Name such that the Candy selected has the highest PreferenceFactor for each Name. (speedy efficient answers preferred).

What indexes are required on the table? Does it make a difference if Name and Candy are integer indexes into another table (aside from requiring some joins)?


回答1:


select c.Name, max(c.Candy) as Candy, max(c.PreferenceFactor) as PreferenceFactor
from Candy c
inner join (
    select Name, max(PreferenceFactor) as MaxPreferenceFactor
    from Candy
    group by Name
) cm on c.Name = cm.Name and c.PreferenceFactor = cm.MaxPreferenceFactor
group by c.Name
order by PreferenceFactor desc, Name



回答2:


You will find that the following query outperforms every other answer given, as it works with a single scan. This simulates MS Access's First and Last aggregate functions, which is basically what you are doing.

Of course, you'll probably have foreign keys instead of names in your CandyPreference table. To answer your question, it is in fact very much best if Candy and Name are foreign keys into another table.

If there are other columns in the CandyPreferences table, then having a covering index that includes the involved columns will yield even better performance. Making the columns as small as possible will increase the rows per page and again increase performance. If you are most often doing the query with a WHERE condition to restrict rows, then an index that covers the WHERE conditions becomes important.

Peter was on the right track for this, but had some unneeded complexity.

CREATE TABLE #CandyPreference (
   [Name] varchar(20),
   Candy varchar(30),
   PreferenceFactor decimal(11, 10)
)
INSERT #CandyPreference VALUES ('Jim', 'Chocolate', 1.0)
INSERT #CandyPreference VALUES ('Brad', 'Lemon Drop', .9)
INSERT #CandyPreference VALUES ('Brad', 'Chocolate', .1)
INSERT #CandyPreference VALUES ('Chris', 'Chocolate', .5)
INSERT #CandyPreference VALUES ('Chris', 'Candy Cane', .5)

SELECT
   [Name],
   Candy = Substring(PackedData, 13, 30),
   PreferenceFactor = Convert(decimal(11,10), Left(PackedData, 12))
FROM (
   SELECT
      [Name],
      PackedData = Max(Convert(char(12), PreferenceFactor) + Candy)
   FROM CandyPreference
   GROUP BY [Name]
) X

DROP TABLE #CandyPreference

I actually don't recommend this method unless performance is critical. The "canonical" way to do it is OrbMan's standard Max/GROUP BY derived table and then a join to it to get the selected row. Though, that method starts to become difficult when there are several columns that participate in the selection of the Max, and the final combination of selectors can be duplicated, that is, when there is no column to provide arbitrary uniqueness as in the case here where we use the name if the PreferenceFactor is the same.

Edit: It's probably best to give some more usage notes to help improve clarity and to help people avoid problems.

  • As a general rule of thumb, when trying to improve query performance, you can do a LOT of extra math if it will save you I/O. Saving an entire table seek or scan speeds up the query substantially, even with all the converts and substrings and so on.
  • Due to precision and sorting issues, use of a floating point data type is probably a bad idea with this method. Though unless you are dealing with extremely large or small numbers, you shouldn't be using float in your database anyway.
  • The best data types are those that are not packed and sort in the same order after conversion to binary or char. Datetime, smalldatetime, bigint, int, smallint, and tinyint all convert directly to binary and sort correctly because they are not packed. With binary, avoid left() and right(), use substring() to get the values reliably returned to their originals.
  • I took advantage of Preference having only one digit in front of the decimal point in this query, allowing conversion straight to char since there is always at least a 0 before the decimal point. If more digits are possible, you would have to decimal-align the converted number so things sort correctly. Easiest might be to multiply your Preference rating so there is no decimal portion, convert to bigint, and then convert to binary(8). In general, conversion between numbers is faster than conversion between char and another data type, especially with date math.
  • Watch out for nulls. If there are any, you must convert them to something and then back.



回答3:


I tried:

SELECT X.PersonName,
    (
        SELECT TOP 1 Candy
        FROM CandyPreferences
        WHERE PersonName=X.PersonName AND PreferenceFactor=x.HighestPreference
    ) AS TopCandy
FROM 
(
    SELECT PersonName, MAX(PreferenceFactor) AS HighestPreference
    FROM CandyPreferences
    GROUP BY PersonName
) AS X

This seems to work, though I can't speak to efficiency without real data and a realistic load.

I did create a primary key over PersonName and Candy, though. Using SQL Server 2008 and no additional indexes shows it using two clustered index scans though, so it could be worse.


I played with this a bit more because I needed an excuse to play with the Data Generation Plan capability of "datadude". First, I refactored the one table to have separate tables for candy names and person names. I did this mostly because it allowed me to use the test data generation without having to read the documentation. The schema became:

CREATE TABLE [Candies](
    [CandyID] [int] IDENTITY(1,1) NOT NULL,
    [Candy] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_Candies] PRIMARY KEY CLUSTERED 
(
    [CandyID] ASC
),
 CONSTRAINT [UC_Candies] UNIQUE NONCLUSTERED 
(
    [Candy] ASC
)
)
GO

CREATE TABLE [Persons](
    [PersonID] [int] IDENTITY(1,1) NOT NULL,
    [PersonName] [nvarchar](100) NOT NULL,
 CONSTRAINT [PK_Preferences.Persons] PRIMARY KEY CLUSTERED 
(
    [PersonID] ASC
)
)
GO

CREATE TABLE [CandyPreferences](
    [PersonID] [int] NOT NULL,
    [CandyID] [int] NOT NULL,
    [PrefernceFactor] [real] NOT NULL,
 CONSTRAINT [PK_CandyPreferences] PRIMARY KEY CLUSTERED 
(
    [PersonID] ASC,
    [CandyID] ASC
)
)
GO

ALTER TABLE [CandyPreferences]  
WITH CHECK ADD  CONSTRAINT [FK_CandyPreferences_Candies] FOREIGN KEY([CandyID])
REFERENCES [Candies] ([CandyID])
GO

ALTER TABLE [CandyPreferences] 
CHECK CONSTRAINT [FK_CandyPreferences_Candies]
GO

ALTER TABLE [CandyPreferences]  
WITH CHECK ADD  CONSTRAINT [FK_CandyPreferences_Persons] FOREIGN KEY([PersonID])
REFERENCES [Persons] ([PersonID])
GO

ALTER TABLE [CandyPreferences] 
CHECK CONSTRAINT [FK_CandyPreferences_Persons]
GO

The query became:

SELECT P.PersonName, C.Candy
FROM (
    SELECT X.PersonID,
        (
            SELECT TOP 1 CandyID
            FROM CandyPreferences
            WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
        ) AS TopCandy
    FROM 
    (
        SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
        FROM CandyPreferences
        GROUP BY PersonID
    ) AS X
) AS Y
INNER JOIN Persons P ON Y.PersonID = P.PersonID
INNER JOIN Candies C ON Y.TopCandy = C.CandyID

With 150,000 candies, 200,000 persons, and 500,000 CandyPreferences, the query took about 12 seconds and produced 200,000 rows.


The following result surprised me. I changed the query to remove the final "pretty" joins:

SELECT X.PersonID,
    (
        SELECT TOP 1 CandyID
        FROM CandyPreferences
        WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
    ) AS TopCandy
FROM 
(
    SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
    FROM CandyPreferences
    GROUP BY PersonID
) AS X

This now takes two or three seconds for 200,000 rows.

Now, to be clear, nothing I've done here has been meant to improve the performance of this query: I considered 12 seconds to be a success. It now says it spends 90% of its time in a clustered index seek.




回答4:


Comment on Emtucifor solution (as I cant make regular comments)

I like this solution, but have some comments how it could be improved (in this specific case).

It can't be done much if you have everything in one table, but having few tables as in John Saunders' solution will make things a bit different.

As we are dealing with numbers in [CandyPreferences] table we can use math operation instead of concatenation to get max value.

I suggest PreferenceFactor to be decimal instead of real, as I believe we don't need here size of real data type, and even further I would suggest decimal(n,n) where n<10 to have only decimal part stored in 5 bytes. Assume decimal(3,3) is enough (1000 levels of preference factor), we can do simple

PackedData = Max(PreferenceFactor + CandyID)

Further, if we know we have less than 1,000,000 CandyIDs we can add cast as:

PackedData = Max(Cast(PreferenceFactor + CandyID as decimal(9,3)))

allowing sql server to use 5 bytes in temporary table

Unpacking is easy and fast using floor function.

Niikola

-- ADDED LATER ---

I tested both solutions, John's and Emtucifor's (modified to use John's structure and using my suggestions). I tested also with and without joins.

Emtucifor's solution clearly wins, but margins are not huge. It could be different if SQL server had to perform some Physical reads, but they were 0 in all cases.

Here are the queries:

    SELECT
   [PersonID],
   CandyID = Floor(PackedData),
   PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM (
   SELECT
      [PersonID],
      PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
   FROM [z5CandyPreferences] With (NoLock)
   GROUP BY [PersonID]
) X

SELECT X.PersonID,
        (
                SELECT TOP 1 CandyID
                FROM z5CandyPreferences
                WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
        ) AS TopCandy,
                    HighestPreference as PreferenceFactor
FROM 
(
        SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
        FROM z5CandyPreferences
        GROUP BY PersonID
) AS X


Select p.PersonName,
       c.Candy,
       y.PreferenceFactor
  From z5Persons p
 Inner Join (SELECT [PersonID],
                    CandyID = Floor(PackedData),
                    PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
                    FROM ( SELECT [PersonID],
                                  PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
                             FROM [z5CandyPreferences] With (NoLock)
                            GROUP BY [PersonID]
                         ) X
            ) Y on p.PersonId = Y.PersonId
 Inner Join z5Candies c on c.CandyId=Y.CandyId

Select p.PersonName,
       c.Candy,
       y.PreferenceFactor
  From z5Persons p
 Inner Join (SELECT X.PersonID,
                    ( SELECT TOP 1 cp.CandyId
                        FROM z5CandyPreferences cp
                       WHERE PersonID=X.PersonID AND cp.[PrefernceFactor]=X.HighestPreference
                    ) CandyId,
                    HighestPreference as PreferenceFactor
               FROM ( SELECT PersonID, 
                             MAX(PrefernceFactor) AS HighestPreference
                        FROM z5CandyPreferences
                       GROUP BY PersonID
                    ) AS X
            ) AS Y on p.PersonId = Y.PersonId
 Inner Join z5Candies as c on c.CandyID=Y.CandyId

And the results:

 TableName          nRows
 ------------------ -------
 z5Persons          200,000
 z5Candies          150,000
 z5CandyPreferences 497,445


Query                       Rows Affected CPU time Elapsed time
--------------------------- ------------- -------- ------------
Emtucifor     (no joins)          183,289   531 ms     3,122 ms
John Saunders (no joins)          183,289 1,266 ms     2,918 ms
Emtucifor     (with joins)        183,289 1,031 ms     3,990 ms
John Saunders (with joins)        183,289 2,406 ms     4,343 ms


Emtucifor (no joins)
--------------------------------------------
Table               Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences           1         2,022 


John Saunders (no joins)
--------------------------------------------
Table               Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences     183,290       587,677

Emtucifor (with joins)
--------------------------------------------
Table               Scan count logical reads
------------------- ---------- -------------
Worktable                    0             0
z5Candies                    1           526
z5CandyPreferences           1         2,022
z5Persons                    1           733

John Saunders (with joins) 
--------------------------------------------
Table               Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences      183292       587,912
z5Persons                    3           802
Worktable                    0             0
z5Candies                    3           559
Worktable                    0             0



回答5:


you could use following select statements

select Name,Candy,PreferenceFactor
from candyTable ct 
where PreferenceFactor = 
    (select max(PreferenceFactor) 
     from candyTable where ct.Name = Name)

but with this select you will get "Chris" 2 times in your result set.

if you want to get the the most preferred food by user than use

select top 1 Name,Candy,PreferenceFactor
from candyTable ct
where name = @name
and PreferenceFactor= 
    (select max([PreferenceFactor]) 
     from candyTable where name = @name )

i think changing the name and candy to integer types might help you improve performance. you also should insert indexes on both columns.

[Edit] changed ! to @




回答6:


SELECT Name, Candy, PreferenceFactor
  FROM table AS a
 WHERE NOT EXISTS(SELECT * FROM table AS b
                   WHERE b.Name = a.Name
                     AND (b.PreferenceFactor > a.PreferenceFactor OR (b.PreferenceFactor = a.PreferenceFactor AND b.Candy > a.Candy))



回答7:


select name, candy, max(preference)
from tablename
where candy=@candy
order by name, candy

usually indexing is required on columns which are frequently included in where clause. In this case I would say indexing on name and candy columns would be of highest priority.

Having lookup tables for columns usually depends on number of repeating values with in columns. Out of 250,000 rows, if there are only 50 values that are repeating, you really need to have integer reference (foreign key) there. In this case, candy reference should be done and name reference really depends on the number of distinct people within the database.




回答8:


I changed your column Name to PersonName to avoid any common reserved word conflicts.

SELECT     PersonName, MAX(Candy) AS PreferredCandy, MAX(PreferenceFactor) AS Factor
FROM         CandyPreference
GROUP BY PersonName
ORDER BY Factor DESC



回答9:


SELECT d.Name, a.Candy, d.MaxPref
FROM myTable a, (SELECT Name, MAX(PreferenceFactor) AS MaxPref FROM myTable) as D
WHERE a.Name = d.Name AND a.PreferenceFactor = d.MaxPref

This should give you rows with matching PrefFactor for a given Name. (e.g. if John as a HighPref of 1 for Lemon & Chocolate).

Pardon my answer as I am writing it without SQL Query Analyzer.




回答10:


Something like this would work:

select name
, candy  = substring(preference,7,len(preference))
  -- convert back to float/numeric
, factor = convert(float,substring(preference,1,5))/10
from (
  select name, 
    preference = (
      select top 1 
           -- convert from float/numeric to zero-padded fixed-width string
           right('00000'+convert(varchar,convert(decimal(5,0),preferencefactor*10)),5)
         + ';' + candy
       from candyTable b
       where a.name = b.name
       order by 
         preferencefactor desc
       , candy
       )
  from (select distinct name from candyTable) a
  ) a

Performance should be decent with with method. Check your query plan.

TOP 1 ... ORDER BY in a correlated subquery allows us to specify arbitrary rules for which row we want returned per row in the outer query. In this case, we want the highest preference factor per name, with candy for tie-breaks.

Subqueries can only return one value, so we must combine candy and preference factor into one field. The semicolon is just for readability here, but in other cases, you might use it to parse the combined field with CHARINDEX in the outer query.

If you wanted full precision in the output, you could use this instead (assuming preferencefactor is a float):

convert(varchar,preferencefactor) + ';' + candy

And then parse it back with:

factor = convert(float,substring(preference,1,charindex(';',preference)-1))
candy = substring(preference,charindex(';',preference)+1,len(preference))



回答11:


I tested also ROW_NUMBER() version + added additional index

Create index IX_z5CandyPreferences On z5CandyPreferences(PersonId,PrefernceFactor,CandyID)

Response times between Emtucifor's and ROW_NUMBER() version (with index in place) are marginal (if any - test should be repeated number of times and take averages, but I expect there would not be any significant difference)

Here is query:

Select p.PersonName,
       c.Candy,
       y.PrefernceFactor
  From z5Persons p
 Inner Join (Select * from (Select cp.PersonId,
       cp.CandyId,
       cp.PrefernceFactor,
       ROW_NUMBER() over (Partition by cp.PersonId Order by cp.PrefernceFactor, cp.CandyId ) as hp
  From z5CandyPreferences cp) X
   Where hp=1) Y on p.PersonId = Y.PersonId
 Inner Join z5Candies c on c.CandyId=Y.CandyId

and results with and without new index:

                           |     Without index    |      With Index
                           ----------------------------------------------
Query (Aff.Rows 183,290)   |CPU time Elapsed time | CPU time Elapsed time
-------------------------- |-------- ------------ | -------- ------------
Emtucifor     (with joins) |1,031 ms     3,990 ms |   890 ms     3,758 ms
John Saunders (with joins) |2,406 ms     4,343 ms | 1,735 ms     3,414 ms
ROW_NUMBER()  (with joins) |2,094 ms     4,888 ms |   953 ms     3,900 ms.


Emtucifor (with joins)         Without index |              With Index
-----------------------------------------------------------------------
Table              |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
Worktable          |         0             0 |          0             0
z5Candies          |         1           526 |          1           526
z5CandyPreferences |         1         2,022 |          1           990
z5Persons          |         1           733 |          1           733

John Saunders (with joins)     Without index |              With Index
-----------------------------------------------------------------------
Table              |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences |    183292       587,912 |    183,290       585,570
z5Persons          |         3           802 |          1           733
Worktable          |         0             0 |          0             0
z5Candies          |         3           559 |          1           526
Worktable          |         0             0 |          -             -


ROW_NUMBER() (with joins)      Without index |              With Index 
-----------------------------------------------------------------------
Table              |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences |         3          2233 |          1           990
z5Persons          |         3           802 |          1           733
z5Candies          |         3           559 |          1           526
Worktable          |         0             0 |          0             0


来源:https://stackoverflow.com/questions/1055274/efficient-sql-2000-query-for-selecting-preferred-candy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!