Backstory: I have a database that has data points of drivers in trucks which also contain the. While in a truck, the driver can have a 'driverstatus'. What I'd like to do is group these statuses by driver, truck.
As of now, I've tried using LAG/LEAD to help. The reason for this is so I can tell when a driverstatus change occurs, and then I can mark that row as having the last datetime of that status.
That in itself is insufficient, because I need to group the statuses by their status and date. For this, I've got something such as DENSE_RANK, but I can't manage to get that right concerning the ORDER BY clause.
Here is my test data, and here is one attempt of many of me floundering with ranking.
/****** Script for SelectTopNRows command from SSMS ******/
DECLARE @SomeTable TABLE
(
loginId VARCHAR(255),
tractorId VARCHAR(255),
messageTime DATETIME,
driverStatus VARCHAR(2)
);
INSERT INTO @SomeTable (loginId, tractorId, messageTime, driverStatus)
VALUES('driver35','23533','2018-08-10 8:33 AM','2'),
('driver35','23533','2018-08-10 8:37 AM','2'),
('driver35','23533','2018-08-10 8:56 AM','2'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 9:07 AM','1'),
('driver35','23533','2018-08-10 9:04 AM','1'),
('driver35','23533','2018-08-12 8:07 AM','3'),
('driver35','23533','2018-08-12 8:37 AM','3'),
('driver35','23533','2018-08-12 9:07 AM','3'),
('driver35','23533','2018-06-12 8:07 AM','2'),
('driver35','23533','2018-06-12 8:37 AM','2'),
('driver35','23533','2018-06-12 9:07 AM','2')
;
SELECT *, DENSE_RANK() OVER(PARTITION BY
loginId, tractorId, driverStatus
ORDER BY messageTime ) FROM @SomeTable
;
My end result would ideally look something like this:
loginId tractorId startTime endTime driverStatus
driver35 23533 2018-08-10 8:33 AM 2018-08-10 8:56 AM 2
driver35 23533 2018-08-10 8:57 AM 2018-08-10 9:07 AM 1
driver35 23533 2018-08-12 8:07 AM 2018-08-12 9:07 AM 3
Any help on this is greatly appreciated.
WITH drivers_data AS
(
SELECT *,
row_num = ROW_NUMBER()
OVER (PARTITION BY loginId,
tractorId,
CAST(messageTime AS date),
driverStatus
ORDER BY messageTime),
row_num_all = ROW_NUMBER()
OVER (PARTITION BY loginId,
tractorId
ORDER BY messageTime),
first_date = FIRST_VALUE (messageTime)
OVER (PARTITION BY loginId,
tractorId,
CAST(messageTime AS date),
driverStatus
ORDER BY messageTime),
last_date = LAST_VALUE (messageTime)
OVER (PARTITION BY loginId,
tractorId,
CAST(messageTime AS date),
driverStatus
ORDER BY messageTime
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM @t
)
SELECT loginId, tractorId, first_date, last_date, driverStatus
FROM drivers_data
WHERE row_num = 1
ORDER BY row_num_all;
OUTPUT:
+==========+===========+=====================+=====================+==============+ | loginId | tractorId | first_date | last_date | driverStatus | |==========|===========|=====================|=====================|==============| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-10-08 08:57:00 | 2018-10-08 09:07:00 | 1 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-12-06 08:07:00 | 2018-12-06 09:07:00 | 2 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-12-08 08:07:00 | 2018-12-08 09:07:00 | 3 | +----------+-----------+---------------------+---------------------+--------------+
I will try to explain what's going on here:
- row_num This is for numbering rows which are restricted by date and status of the driver. We need casting since we need date part without time.
- row_num_all This is the key attribute since it allows us in the end to sort rows by occurrence. This window is not restricted by status since we need numbering across whole driver's data.
- first_date The
FIRST_VALUE
is handy function for our purpose. It just retrieves the first datetime occurrence. - last_date It's correct to assume that for the last date we need
LAST_VALUE
window function. But using it is tricky and requires more explanation. As you can see I explicitly use special framingROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
. But why? Let me explain. Let's take a part of output for date10/8/2018
and status2
with default framing. We get the following results:
+==========+===========+=====================+=====================+==============+ | loginId | tractorId | first_date | last_date | driverStatus | |==========|===========|=====================|=====================|==============| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2 | +----------+-----------+---------------------+---------------------+--------------+
As you can see, the last date is incorrect! This happens because LAST_VALUE
uses default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
- it means that last row is always current row in window. Here's what happens under the hood. Three windows get created. Each row gets its own window. Then it retrieves the last row from window:
Window for 1-st row
+==========+===========+=====================+=====================+==============+ | loginId | tractorId | first_date | last_date | driverStatus | |==========|===========|=====================|=====================|==============| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 | +----------+-----------+---------------------+---------------------+--------------+
Window for 2-nd row
+==========+===========+=====================+=====================+==============+ | loginId | tractorId | first_date | last_date | driverStatus | |==========|===========|=====================|=====================|==============| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2 | +----------+-----------+---------------------+---------------------+--------------+
Window for 3-rd row
+==========+===========+=====================+=====================+==============+ | loginId | tractorId | first_date | last_date | driverStatus | |==========|===========|=====================|=====================|==============| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2 | |----------|-----------|---------------------|---------------------|--------------| | driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2 | +----------+-----------+---------------------+---------------------+--------------+
So, the solution for this is to change framing: we need to move not from beginning to current row, but from current row to the end. So, UNBOUNDED FOLLOWING
just means this - last row in current window.
Next is
WHERE row_num = 1
. This is all simple: since all rows has same information about first date and last date, we just need first row.The final part is
ORDER BY row_num_all
. This is where you get your correct ordering.
P.S.
Your desired output in question is incorrect. For date
8/10/18 8:57 AM
and status1
the last date must be10/8/2018 9:07 AM
- not10/8/2018 9:04 AM
, as you mentioned.Also there's missing output for date
12/6/2018
and status2
.
UPDATE:
Here are illustrations of how FIRST_VALUE
and LAST_VALUE
work.
All three figures have following parts:
- Query data This is result of query.
- Original query Original source data.
- Windows These are intermediate steps for calculations.
- Frame Mentions which frame is used.
- Green cell Window specification.
Here's what's happening under the hood:
- First, SQL Server creates partitions for all mentioned fields. On figure it is
partition
column. - Each partition can have a frame: either default or custom. The default frame is
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. This means that the row gets window between the start of partition and current row. If you don't mention a frame, the default frame comes into play. - Each frame creates window for each row. On figures these windows are in columns
row 1
torow 2
and marked with color. The row number corresponds torow_num_all
field. - A row operates only in bounds of its window.
1. FIRST_VALUE
To get first date, we can use handy FIRST_VALUE
window function.
As you can see, we use default frame here. This means that for each row the window will be between the start of window and current row. For getting first date this is just what we need. Each row will fetch the value from first row. The first date is in "first_date" field.
2. LAST_VALUE - incorrect frame
Now we need to calculate last date. The last date is in the last row of partition, so we can use LAST_VALUE
window function.
As I mentioned earlier, if we don't mention frame, the default frame is used. As you can see on figure, the frame always ends in the current row - this is incorrect, because we need the date from last window row. The last_date
field shows us incorrect results - it reflects the date from current row.
3. LAST_VALUE - correct frame
To fix the situation with fetching last date, we need to change the frame upon which LAST_VALUE
will operate on: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
. As you can see, the window for each row now is between current row and the end of partition. In this case LAST_VALUE
will correctly fetch the date from last row of window. Now the result in last_date
field is correct.
The below solution identifies each time an island starts (when a driverStatus
changes) within each loginID
/ tractorID
combination, and then assigns an "id" number to that island.
After that, its a simple min
/max
to find when that island starts and ends.
Answer:
select b.loginId
, b.tractorId
, min(b.messageTime) as startTime
, max(b.messageTime) as endTime
, b.driverStatus
from (
select a.loginId
, a.tractorId
, a.messageTime
, a.driverStatus
, a.is_island_start_flg
, sum(a.is_island_start_flg) over (partition by a.loginID, a.tractorID order by a.messageTime asc) as island_nbr --assigning the "id" number to the island
from (
select st.loginId
, st.tractorId
, st.messageTime
, st.driverStatus
, iif(lag(st.driverStatus, 1, st.driverStatus) over (partition by st.loginID, st.tractorId order by st.messageTime asc) = st.driverStatus, 0, 1) as is_island_start_flg --identifying start of island
from @SomeTable as st
) as a
) as b
group by b.loginId
, b.tractorId
, b.driverStatus
, b.island_nbr --purposefully in the group by, to make sure each occurrence of a status is in final results
order by b.loginId asc
, b.tractorId asc
, min(b.messageTime) asc
When you leave off the last three records of the sample data (as this is not in the expected output of the question, just like JohnyL said), this query produces the exact output from the question.
SELECT
t.loginId,
t.tractorId,
startTime = MIN(messageTime),
endTime = MAX(messageTime),
driverStatus
FROM @someTable t
GROUP BY loginId, tractorId, driverStatus
ORDER BY MIN(messageTime);
Results:
loginId tractorId startTime endTime driverStatus
-------------- ---------- ----------------------- ----------------------- ------------
driver35 23533 2018-10-08 08:33:00.000 2018-10-08 08:56:00.000 2
driver35 23533 2018-10-08 08:57:00.000 2018-10-08 09:07:00.000 1
driver35 23533 2018-12-08 08:07:00.000 2018-12-08 09:07:00.000 3
来源:https://stackoverflow.com/questions/51846536/islands-and-gaps-issue