The SQL OVER() clause - when and why is it useful?

后端 未结 8 1211
有刺的猬
有刺的猬 2020-11-28 17:49
    USE AdventureWorks2008R2;
GO
SELECT SalesOrderID, ProductID, OrderQty
    ,SUM(OrderQty) OVER(PARTITION BY SalesOrderID) AS \'Total\'
    ,AVG(OrderQty) OVER(PAR         


        
相关标签:
8条回答
  • 2020-11-28 18:12

    You can use GROUP BY SalesOrderID. The difference is, with GROUP BY you can only have the aggregated values for the columns that are not included in GROUP BY.

    In contrast, using windowed aggregate functions instead of GROUP BY, you can retrieve both aggregated and non-aggregated values. That is, although you are not doing that in your example query, you could retrieve both individual OrderQty values and their sums, counts, averages etc. over groups of same SalesOrderIDs.

    Here's a practical example of why windowed aggregates are great. Suppose you need to calculate what percent of a total every value is. Without windowed aggregates you'd have to first derive a list of aggregated values and then join it back to the original rowset, i.e. like this:

    SELECT
      orig.[Partition],
      orig.Value,
      orig.Value * 100.0 / agg.TotalValue AS ValuePercent
    FROM OriginalRowset orig
      INNER JOIN (
        SELECT
          [Partition],
          SUM(Value) AS TotalValue
        FROM OriginalRowset
        GROUP BY [Partition]
      ) agg ON orig.[Partition] = agg.[Partition]
    

    Now look how you can do the same with a windowed aggregate:

    SELECT
      [Partition],
      Value,
      Value * 100.0 / SUM(Value) OVER (PARTITION BY [Partition]) AS ValuePercent
    FROM OriginalRowset orig
    

    Much easier and cleaner, isn't it?

    0 讨论(0)
  • 2020-11-28 18:12

    So in simple words: Over clause can be used to select non aggregated values along with Aggregated ones.

    Partition BY, ORDER BY inside, and ROWS or RANGE are part of OVER() by clause.

    partition by is used to partition data and then perform these window, aggregated functions, and if we don't have partition by the then entire result set is considered as a single partition.

    OVER clause can be used with Ranking Functions(Rank, Row_Number, Dense_Rank..), Aggregate Functions like (AVG, Max, Min, SUM...etc) and Analytics Functions like (First_Value, Last_Value, and few others).

    Let's See basic syntax of OVER clause

    OVER (   
           [ <PARTITION BY clause> ]  
           [ <ORDER BY clause> ]   
           [ <ROW or RANGE clause> ]  
          )  
    

    PARTITION BY: It is used to partition data and perform operations on groups with the same data.

    ORDER BY: It is used to define the logical order of data in Partitions. When we don't specify Partition, entire resultset is considered as a single partition

    : This can be used to specify what rows are supposed to be considered in a partition when performing the operation.

    Let's take an example:

    Here is my dataset:

    Id          Name                                               Gender     Salary
    ----------- -------------------------------------------------- ---------- -----------
    1           Mark                                               Male       5000
    2           John                                               Male       4500
    3           Pavan                                              Male       5000
    4           Pam                                                Female     5500
    5           Sara                                               Female     4000
    6           Aradhya                                            Female     3500
    7           Tom                                                Male       5500
    8           Mary                                               Female     5000
    9           Ben                                                Male       6500
    10          Jodi                                               Female     7000
    11          Tom                                                Male       5500
    12          Ron                                                Male       5000
    

    So let me execute different scenarios and see how data is impacted and I'll come from difficult syntax to simple one

    Select *,SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees
    
    Id          Name                                               Gender     Salary      sum_sal
    ----------- -------------------------------------------------- ---------- ----------- -----------
    6           Aradhya                                            Female     3500        3500
    5           Sara                                               Female     4000        7500
    2           John                                               Male       4500        12000
    3           Pavan                                              Male       5000        32000
    1           Mark                                               Male       5000        32000
    8           Mary                                               Female     5000        32000
    12          Ron                                                Male       5000        32000
    11          Tom                                                Male       5500        48500
    7           Tom                                                Male       5500        48500
    4           Pam                                                Female     5500        48500
    9           Ben                                                Male       6500        55000
    10          Jodi                                               Female     7000        62000
    

    Just observe the sum_sal part. Here I am using order by Salary and using "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW". In this case, we are not using partition so entire data will be treated as one partition and we are ordering on salary. And the important thing here is UNBOUNDED PRECEDING AND CURRENT ROW. This means when we are calculating the sum, from starting row to the current row for each row. But if we see rows with salary 5000 and name="Pavan", ideally it should be 17000 and for salary=5000 and name=Mark, it should be 22000. But as we are using RANGE and in this case, if it finds any similar elements then it considers them as the same logical group and performs an operation on them and assigns value to each item in that group. That is the reason why we have the same value for salary=5000. The engine went up to salary=5000 and Name=Ron and calculated sum and then assigned it to all salary=5000.

    Select *,SUM(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees
    
    
       Id          Name                                               Gender     Salary      sum_sal
    ----------- -------------------------------------------------- ---------- ----------- -----------
    6           Aradhya                                            Female     3500        3500
    5           Sara                                               Female     4000        7500
    2           John                                               Male       4500        12000
    3           Pavan                                              Male       5000        17000
    1           Mark                                               Male       5000        22000
    8           Mary                                               Female     5000        27000
    12          Ron                                                Male       5000        32000
    11          Tom                                                Male       5500        37500
    7           Tom                                                Male       5500        43000
    4           Pam                                                Female     5500        48500
    9           Ben                                                Male       6500        55000
    10          Jodi                                               Female     7000        62000
    

    So with ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW The difference is for same value items instead of grouping them together, It calculates SUM from starting row to current row and it doesn't treat items with same value differently like RANGE

    Select *,SUM(salary) Over(order by salary) as sum_sal from employees
    
    Id          Name                                               Gender     Salary      sum_sal
    ----------- -------------------------------------------------- ---------- ----------- -----------
    6           Aradhya                                            Female     3500        3500
    5           Sara                                               Female     4000        7500
    2           John                                               Male       4500        12000
    3           Pavan                                              Male       5000        32000
    1           Mark                                               Male       5000        32000
    8           Mary                                               Female     5000        32000
    12          Ron                                                Male       5000        32000
    11          Tom                                                Male       5500        48500
    7           Tom                                                Male       5500        48500
    4           Pam                                                Female     5500        48500
    9           Ben                                                Male       6500        55000
    10          Jodi                                               Female     7000        62000
    

    These results are the same as

    Select *, SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees
    

    That is because Over(order by salary) is just a short cut of Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) So wherever we simply specify Order by without ROWS or RANGE it is taking RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW as default.

    Note: This is applicable only to Functions that actually accept RANGE/ROW. For example, ROW_NUMBER and few others don't accept RANGE/ROW and in that case, this doesn't come into the picture.

    Till now we saw that Over clause with an order by is taking Range/ROWS and syntax looks something like this RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW And it is actually calculating up to the current row from the first row. But what If it wants to calculate values for the entire partition of data and have it for each column (that is from 1st row to last row). Here is the query for that

    Select *,sum(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_sal from employees
    
    Id          Name                                               Gender     Salary      sum_sal
    ----------- -------------------------------------------------- ---------- ----------- -----------
    1           Mark                                               Male       5000        62000
    2           John                                               Male       4500        62000
    3           Pavan                                              Male       5000        62000
    4           Pam                                                Female     5500        62000
    5           Sara                                               Female     4000        62000
    6           Aradhya                                            Female     3500        62000
    7           Tom                                                Male       5500        62000
    8           Mary                                               Female     5000        62000
    9           Ben                                                Male       6500        62000
    10          Jodi                                               Female     7000        62000
    11          Tom                                                Male       5500        62000
    12          Ron                                                Male       5000        62000
    

    Instead of CURRENT ROW, I am specifying UNBOUNDED FOLLOWING which instructs the engine to calculate till the last record of partition for each row.

    Now coming to your point on what is OVER() with empty braces?

    It is just a short cut for Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

    Here we are indirectly specifying to treat all my resultset as a single partition and then perform calculations from the first record to the last record of each partition.

    Select *,Sum(salary) Over() as sum_sal from employees
    
    Id          Name                                               Gender     Salary      sum_sal
    ----------- -------------------------------------------------- ---------- ----------- -----------
    1           Mark                                               Male       5000        62000
    2           John                                               Male       4500        62000
    3           Pavan                                              Male       5000        62000
    4           Pam                                                Female     5500        62000
    5           Sara                                               Female     4000        62000
    6           Aradhya                                            Female     3500        62000
    7           Tom                                                Male       5500        62000
    8           Mary                                               Female     5000        62000
    9           Ben                                                Male       6500        62000
    10          Jodi                                               Female     7000        62000
    11          Tom                                                Male       5500        62000
    12          Ron                                                Male       5000        62000
    

    I did create a video on this and if you are interested you can visit it. https://www.youtube.com/watch?v=CvVenuVUqto&t=1177s

    Thanks, Pavan Kumar Aryasomayajulu HTTP://xyzcoder.github.io

    0 讨论(0)
  • 2020-11-28 18:13

    The OVER clause when combined with PARTITION BY state that the preceding function call must be done analytically by evaluating the returned rows of the query. Think of it as an inline GROUP BY statement.

    OVER (PARTITION BY SalesOrderID) is stating that for SUM, AVG, etc... function, return the value OVER a subset of the returned records from the query, and PARTITION that subset BY the foreign key SalesOrderID.

    So we will SUM every OrderQty record for EACH UNIQUE SalesOrderID, and that column name will be called 'Total'.

    It is a MUCH more efficient means than using multiple inline views to find out the same information. You can put this query within an inline view and filter on Total then.

    SELECT ...,
    FROM (your query) inlineview
    WHERE Total < 200
    
    0 讨论(0)
  • 2020-11-28 18:15

    Let me explain with an example and you would be able to see how it works.

    Assuming you have the following table DIM_EQUIPMENT:

    VIN         MAKE    MODEL   YEAR    COLOR
    -----------------------------------------
    1234ASDF    Ford    Taurus  2008    White
    1234JKLM    Chevy   Truck   2005    Green
    5678ASDF    Ford    Mustang 2008    Yellow
    

    Run below SQL

    SELECT VIN,
      MAKE,
      MODEL,
      YEAR,
      COLOR ,
      COUNT(*) OVER (PARTITION BY YEAR) AS COUNT2
    FROM DIM_EQUIPMENT
    

    The result would be as below

    VIN         MAKE    MODEL   YEAR    COLOR     COUNT2
     ----------------------------------------------  
    1234JKLM    Chevy   Truck   2005    Green     1
    5678ASDF    Ford    Mustang 2008    Yellow    2
    1234ASDF    Ford    Taurus  2008    White     2
    

    See what happened.

    You are able to count without Group By on YEAR and Match with ROW.

    Another Interesting WAY to get same result if as below using WITH Clause, WITH works as in-line VIEW and can simplify the query especially complex ones, which is not the case here though since I am just trying to show usage

     WITH EQ AS
      ( SELECT YEAR AS YEAR2, COUNT(*) AS COUNT2 FROM DIM_EQUIPMENT GROUP BY YEAR
      )
    SELECT VIN,
      MAKE,
      MODEL,
      YEAR,
      COLOR,
      COUNT2
    FROM DIM_EQUIPMENT,
      EQ
    WHERE EQ.YEAR2=DIM_EQUIPMENT.YEAR;
    
    0 讨论(0)
  • 2020-11-28 18:16
    • Also Called Query Petition Clause.
    • Similar to the Group By Clause

      • break up data into chunks (or partitions)
      • separate by partition bounds
      • function performs within partitions
      • re-initialised when crossing parting boundary

    Syntax:
    function (...) OVER (PARTITION BY col1 col3,...)

    • Functions

      • Familiar functions such as COUNT(), SUM(), MIN(), MAX(), etc
      • New Functions as well (eg ROW_NUMBER(), RATION_TO_REOIRT(), etc.)


    More info with example : http://msdn.microsoft.com/en-us/library/ms189461.aspx

    0 讨论(0)
  • 2020-11-28 18:25

    If you only wanted to GROUP BY the SalesOrderID then you wouldn't be able to include the ProductID and OrderQty columns in the SELECT clause.

    The PARTITION BY clause let's you break up your aggregate functions. One obvious and useful example would be if you wanted to generate line numbers for order lines on an order:

    SELECT
        O.order_id,
        O.order_date,
        ROW_NUMBER() OVER(PARTITION BY O.order_id) AS line_item_no,
        OL.product_id
    FROM
        Orders O
    INNER JOIN Order_Lines OL ON OL.order_id = O.order_id
    

    (My syntax might be off slightly)

    You would then get back something like:

    order_id    order_date    line_item_no    product_id
    --------    ----------    ------------    ----------
        1       2011-05-02         1              5
        1       2011-05-02         2              4
        1       2011-05-02         3              7
        2       2011-05-12         1              8
        2       2011-05-12         2              1
    
    0 讨论(0)
提交回复
热议问题