问题
I'm having some troubles to deal with a very big table in my database. Before to talk about the problem, let's talk about what i want to achieve.
I have two source tables :
- Source 1:
SALES_MAN (ID_SMAN, SM_LATITUDE, SM_LONGITUDE)
Source 2:
CLIENT (ID_CLIENT, CLATITUDE, CLONGITUDE)
Target:
DISTANCE (ID_SMAN, ID_CLIENT, SM_LATITUDE, SM_LONGITUDE, CLATITUDE, CLONGITUDE, DISTANCE)
The idea is to find the top N nearest SALES_MAN
for every client using a ROW_NUMBER
in the target table.
What I'm doing currently is calculating the distance between every client and every sales man :
INSERT INTO DISTANCE ([ID_SMAN], [ID_CLIENT], [DISTANCE],
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE])
SELECT
[ID_SMAN], [ID_CLIENT],
geography::STGeomFromText('POINT('+IND_LATITUDE+' '+IND_LONGITUDE+')',4326).STDistance(geography::STGeomFromText('POINT('+DLR.[DLR_N_GPS_LATTITUDE]+' '+DLR.[DLR_N_GPS_LONGITUDE]+')',4326))/1000 as distance,
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE]
FROM
[dbo].[SALES_MAN], [dbo].[CLIENT]
The DISTANCE
table contains approximately 1 milliards rows.
The second step to get my 5 nearest sales man per client is to run this query :
SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) rang
FROM DISTANCE) TAB
WHERE rang < 6
The last query is really a consuming one. So to avoid the SORT operator I tried to create an sorted non clustered index in DISTANCE and ID_CLIENT but it did not work. I also tried to include all the needed columns in the both indexes.
But when I created a clustered index on DISTANCE
and keep the nonclustered sorted index in the ID_CLIENT
the things went better.
So what a nonclustered sorting index is not working in this case?
But when I use the clustered index, I have other problem in loading data and I'm kind of forced to delete it before starting the loading process.
So what do you think? And how we can deal with this kind of tables to be able to select, insert or update data without having performance issues ?
Many thanks
回答1:
Too long for a comment, but consider the following points.
Item 1) Consider adding a Geography
field to each of your source tables. This will eliminate the redundant GEOGRAPHY::Point()
function calls
Update YourTable Set GeoPoint = GEOGRAPHY::Point([Lat], [Lng], 4326)
So then the calculation for distance would simply be
,InMeters = C.GeoPoint.STDistance(S.GeoPoint)
,InMiles = C.GeoPoint.STDistance(S.GeoPoint) / 1609.344
Item 2) Rather than generating EVERY possible combination, consider a adding a condtion to the JOIN
. Keep in mind that every "1" of Lat or Lng is approx 69 miles, so you can reduce the search area. For example
From CLIENT C
Join SALES_MAN S
on S.Lat between C.Lat-1 and C.Lat+1
and S.Lng between C.Lng-1 and C.Lng+1
This +/- 1
could be any reasonable value ... (i.e. 0.5
or even 2.0
)
回答2:
ROW_NUMBER is a window function that requires the whole rows related with the ORDER BY 's column so its better to filter your result before ROW_NUMBER,
and you've to change the following code :
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE)
rang FROM DISTANCE
) TAB
WHERE rang < 6
into this:
WITH DISTANCE_CLIENT_IDS (CLIENT_ID) AS
(
SELECT DISTINCT CLIENT_ID
FROM DISTANCE
)
SELECT Dx.*
FROM DISTANCE_CLIENT_IDS D1,
(
SElECT * , ROW_NUMBER(ORDER BY DISTANCE) RANGE
FROM (
SELECT TOP(5) *
FROM DISTANCE D2
WHERE D1.CLIENT_ID = D2.CLIENT_ID
) Dt
) Dx
and make sure you'd added indexes on both CLIENT_ID and DISTANCE columns
来源:https://stackoverflow.com/questions/45869613/handling-very-big-table-in-sql-server-performance