Chapter 7: Aggregations and Windowing

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 7

Aggregations and Windowing

by Wayne Sheffield

In this chapter, we will look at several of the built-in functions that are frequently used when querying data for reporting purposes. We’ll start off with the aggregate functions in their nonwindowed form. We’ll then explore the windowing functions: aggregate functions, ranking functions, analytic functions, and the NEXT VALUE FOR sequence generation function.

Aggregate Functions

Aggregate functions are used to perform a calculation on one or more values, resulting in a single value. If your query has any columns with any nonwindowed aggregate functions, then a GROUP BY clause is required for the query. Table 7-1 shows the various aggregate functions.

Table 7-1. Aggregate Functions

Function Name	Description
AVG	The AVG aggregate function calculates the average of non-NULL values in a group.
CHECKSUM_AGG	The CHECKSUM_AGG function returns a checksum value based on a group of rows, allowing you to potentially track changes to a table. For example, adding a new row or changing the value of a column that is being aggregated will usually result in a new checksum integer value. The reason I say “usually” is because there is a possibility that the checksum value does not change even if values are modified.
COUNT	The COUNT aggregate function returns an integer data type showing the count of rows in a group, including rows with NULL values.
COUNT_BIG	The COUNT_BIG aggregate function returns a bigint data type showing the count of rows in a group, including rows with NULL values.
GROUPING	The GROUPING function returns 1 (True) or 0 (False) depending on whether a NULL value is due to a CUBE, ROLLUP, or GROUPING SETS operation. If False, the column expression NULL value is from the natural data. See Recipe 5-8 for usage of this function.
MAX	The MAX aggregate function returns the highest value in a set of non-NULL values.
MIN	The MIN aggregate function returns the lowest value in a group of non-NULL values.
STDEV	The STDEV function returns the standard deviation of all values provided in the expression based on a sample of the data population.
STDEVP	The STDEVP function also returns the standard deviation for all values in the provided expression, based upon the entire data population.
SUM	The SUM aggregate function returns the summation of all non-NULL values in an expression.
VAR	The VAR function returns the statistical variance of values in an expression based upon a sample of the provided population.
VARP	The VARP function returns the statistical variance of values in an expression based upon the entire data population.

With the exception of the COUNT, COUNT_BIG, and GROUPING functions, all of the aggregate functions have the same syntax (the syntax and usage of the GROUPING function is discussed in Recipe 5-8; the syntax for the COUNT and COUNT_BIG functions is discussed in Recipe 7-2).

function_name ( { [ [ ALL | DISTINCT ] expression ] } )

where expression is typically the column or calculation that the function will be calculated over. If the optional keyword DISTINCT is used, then only distinct values will be considered. If the optional keyword ALL is used, then all values will be considered. If neither is specified, then ALL is used by default. Aggregate functions and subqueries are not allowed for the expression parameter.

The next few recipes demonstrate these aggregate functions.

7-1. Computing an Average

Problem

You want to see the average rating of your products.

Solution

Use the AVG function to determine an average.

SELECT ProductID,
       AVG(Rating) AS AvgRating
FROM   Production.ProductReview
GROUP  BY ProductID;

This query produces the following result set:

ProductID     AvgRating
---------     ---------
709           5
798           5
937           3

How It Works

The AVG aggregate function calculates the average of non-NULL values in a group. To demonstrate the use of DISTINCT, let’s compare the columns returned from the following query:

SELECT StudentId,
       AVG(Grade) AS AvgGrade,
       AVG(DISTINCT Grade) AS AvgDistinctGrade
FROM  (VALUES (1, 100),
              (1, 100),
              (1, 100),
              (1, 99),
              (1, 99),
              (1, 98),
              (1, 98),
              (1, 95),
              (1, 95),
              (1, 95)
       ) dt (StudentId, Grade)
GROUP BY StudentID;

This query produces the following result set:

StudentId     AvgGrade     AvgDistinctGrade
---------     --------     ----------------
1             97           98

In this example, we have a student with 10 grades. The average of all 10 grades is 97. Within these 10 grades are 4 distinct grades. The average of these distinct grades is 98.

When utilizing the AVG function, the expression parameter must be one of the numeric data types.

7-2. Counting the Rows in a Group

Problem

You want to see the number of products you have in inventory on each shelf for your first five shelves.

Solution

Utilize the COUNT or COUNT_BIG function to return the count of rows in a group.

SELECT TOP (5)
       Shelf,
       COUNT(ProductID) AS ProductCount,
       COUNT_BIG(ProductID) AS ProductCountBig
FROM   Production.ProductInventory
GROUP  BY Shelf
ORDER  BY Shelf;

This query returns the following result set:

Shelf     ProductCount     ProductCountBig
-----     ------------     ---------------
A         81               81
B         36               36
C         55               55
D         50               50
E         85               85

How It Works

The COUNT and COUNT_BIG functions are utilized to return a count of the number of items in a group. The only difference between them is the data type returned: COUNT returns an INTEGER, while COUNT_BIG returns a BIGINT. The syntax for these functions is as follows:

COUNT | COUNT_BIG ( { [ [ ALL | DISTINCT ] expression ] | * } )

The difference between this syntax and the other aggregate functions is the optional asterisk (*) that can be specified. When COUNT(*) is utilized, this specifies that all rows should be counted to return the total number of rows within a table without getting rid of duplicates. COUNT(*) does not use any parameters, so it does not use any information about any column.

When utilizing the COUNT or COUNT_BIG function, the expression parameter can be of any data type except for the text, image, or ntext data type.

7-3. Summing the Values in a Group

Problem

You want to see the total due by account number for orders placed.

Summary

Utilize the SUM function to add up a column.

SELECT TOP (5)
       AccountNumber,
       SUM(TotalDue) AS TotalDueByAccountNumber
FROM   Sales.SalesOrderHeader
GROUP BY AccountNumber
ORDER BY AccountNumber;

This code returns the following result set:

AccountNumber      TotalDueByAccountNumber
--------------     -----------------------
10-4020-000001     95924.0197
10-4020-000002     28309.9672
10-4020-000003     407563.0075
10-4020-000004     660645.9404
10-4020-000005     97031.2173

How It Works

The SUM function returns the total of all values in the column being totaled. If the DISTINCT keyword is specified, the total of the distinct values in the column will be returned.

When utilizing the SUM function, the expression parameter must be one of the exact or approximate numeric data types, except for the bit data type.

7-4. Finding the High and Low Values in a Group

Problem

You want to see the highest and lowest ratings given on your products.

Solution

Utilize the MAX and MIN functions to return the highest and lowest ratings.

SELECT MIN(Rating) MinRating,
       MAX(Rating) MaxRating
FROM   Production.ProductReview;

This query returns the following result set:

MinRating     MaxRating
---------     ---------
2             5

How It Works

The MAX and MIN functions return the highest and lowest values from the expression being evaluated. Since nonaggregated columns are not specified, a GROUP BY clause is not required.

When utilizing the MAX and MIN functions, the expression parameter can be of any of the numeric, character, uniqueidentifier, or datetime data types.

7-5. Detecting Changes in a Table

Problem

You need to determine whether any changes have been made to the data in a column.

Solution

Utilize the CHECKSUM_AGG function to detect changes in a table.

SELECT StudentId,
       CHECKSUM_AGG(Grade) AS GradeChecksumAgg
FROM (VALUES (1, 100),
             (1, 100),
             (1, 100),
             (1, 99),
             (1, 99),
             (1, 98),
             (1, 98),
             (1, 95),
             (1, 95),
             (1, 95)
      ) dt (StudentId, Grade)
GROUP BY StudentID;
   
SELECT StudentId,
       CHECKSUM_AGG(Grade) AS GradeChecksumAgg
FROM   (VALUES (1, 100),
               (1, 100),
               (1, 100),
               (1, 99),
               (1, 99),
               (1, 98),
               (1, 98),
               (1, 95),
               (1, 95),
               (1, 90)
         ) dt (StudentId, Grade)
GROUP BY StudentID;

These queries return the following result sets:

StudentId     GradeChecksumAgg
---------     ----------------
1             59
 
StudentId     GradeChecksumAgg
---------     ----------------
1             62

How It Works

The CHECKSUM_AGG function returns the checksum of the values in the group, in this case the Grade column. In the second query, the last grade is changed, and when the query is rerun, the aggregated checksum returns a different value.

When utilizing the CHECKSUM_AGG function, the expression parameter must be an integer data type.

Note Because of the hashing algorithm being used, it is possible for the CHECKSUM_AGG function to return the same value with different data. You should use this only if your application can tolerate occasionally missing a change.

7-6. Finding the Statistical Variance in the Values of a Column

Problem

You need to find the statistical variance of the data values in a column.

Solution

Utilize the VAR or VARP functions to return statistical variance.

SELECT VAR(TaxAmt) AS Variance_Sample,
       VARP(TaxAmt) AS Variance_EntirePopulation
FROM   Sales.SalesOrderHeader;

This query returns the following result set:

Variance_Sample      Variance_EntirePopulation
----------------     -------------------------
1177342.57277401     1177305.15524429

How It Works

The VAR and VARP functions return the statistical variance of all the values in the specified expression. VAR returns the value based upon a sample of the data population; VARP returns the value based upon the entire data population.

When utilizing the VAR or VARP functions, the expression parameter must be one of the exact or approximate numeric data types, except for the bit data type.

7-7. Finding the Standard Deviation in the Values of a Column

Problem

You need to see the standard deviation of the data values in a column.

Solution

Utilize the STDEV and STDEVP functions to obtain standard deviation values.

SELECT STDEV(UnitPrice) AS StandDevUnitPrice,
       STDEVP(UnitPrice) AS StandDevPopUnitPrice
FROM   Sales.SalesOrderDetail;

This query returns the following result set:

StandDevUnitPrice     StandDevPopUnitPrice
-----------------     --------------------
751.885080772954      751.881981921885

How It Works

The STDEV or STDEVP functions return the standard deviation of all the values in the specified expression. STDEV returns the value based upon a sample of the data population; STDEVP returns the value based upon the entire data population.

When utilizing the STDEV or STDEVP functions, the expression parameter must be one of the exact or approximate numeric data types, except for the bit data type.

Windowing Functions

SQL Server is designed to work best on sets of data. By definition, sets of data are unordered; it is not until the final ORDER BY clause that the final results of the query become ordered. Windowing functions allow your query to look at only a subset of the rows being returned by your query to apply the function to. In doing so, they allow you to specify an order to your unordered data set before the final result is ordered. This allows for processes that previously required self-joins, use of inefficient inequality operators, or non-set-based row-by-row processing to use set-based processing.

The key to windowing functions is in controlling the order that the rows are evaluated in, when the evaluation is restarted, and what set of rows within the result set to consider for the function (the window of the data set that the function will be applied to). These actions are performed with the OVER clause.

There are four groups of functions that the OVER clause can be applied to; in other words, there are four groups of functions that can be windowed. These groups are the aggregate functions, the ranking functions, the analytic functions, and the sequence function.

The syntax for the OVER clause is as follows:

OVER (
  [ <PARTITION BY clause> ]
  [ <ORDER BY clause> ]
  [ <ROW or RANGE clause> ]
  )
 
<PARTITION BY clause> ::=
PARTITION BY value_expression , ... [ n ]
 
<ORDER BY clause> ::=
ORDER BY order_by_expression
  [ COLLATE collation_name ]
  [ ASC | DESC ]
  [ ,... n ]
 
<ROW or RANGE clause> ::=
{ ROWS | RANGE } <window frame extent>
 
<window frame extent> ::=
{ <window frame preceding>
 | <window frame between>
}
 
<window frame between> ::=
 BETWEEN <window frame bound> AND <window frame bound>
 
<window frame bound> ::=
{ <window frame preceding>
 | <window frame following>
}
 
<window frame preceding> ::=
{
  UNBOUNDED PRECEDING
 | <unsigned_value_specification> PRECEDING
 | CURRENT ROW
}
 
<window frame following> ::=
{
  UNBOUNDED FOLLOWING
 | <unsigned_value_specification> FOLLOWING
 | CURRENT ROW
}
 
<unsigned value specification> ::=
{ <unsigned integer literal> }

Table 7-2 explains each of these parameters.

Table 7-2. OVER Clause Parameters

Function Name	Description
PARTITION BY	Divides the query result set into partitions. The window function is applied to each partition separately, and computation restarts for each partition.
value_expression	Specifies the column by which the row set is partitioned. value_expression can refer only to columns made available by the FROM clause. value_expression cannot refer to expressions or aliases in the select list. value_expression can be a column expression, scalar subquery, scalar function, or user-defined variable.
ORDER BY clause	Defines the logical order of the rows within each partition of the result set. That is, it specifies the logical order in which the window function calculation is performed. Using the ORDER BY clause in the OVER clause does not control the ordering of the final result set. To guarantee the order of the final result set, you must utilize the ORDER BY clause of the SELECT statement.
order_by_expression	Specifies a column or expression on which to sort. order_by_expression can only refer to columns made available by the FROM clause. An integer cannot be specified to represent a column name or alias.
ROWS \| RANGE	Further limits the rows within the partition by specifying start and end points within the partition. This is done by specifying a range of rows with respect to the current row by either logical association or physical association. Physical association is achieved by using the ROWS clause. 1. The ROWS clause limits the rows within a partition by specifying a fixed number of rows preceding or following the current row. Alternatively, the RANGE clause logically limits the rows within a partition by specifying a range of values with respect to the value in the current row. Preceding and following rows are defined based on the ordering in the ORDER BY clause. The window frame RANGE … CURRENT ROW … includes all rows that have the same values in the ORDER BY expression as the current row. For example, ROWS BETWEEN 2 PRECEDING AND CURRENT ROW means that the window of rows that the function operates on is three rows in size, starting with two rows preceding until and including the current row.
UNBOUNDED PRECEDING	Specifies that the window starts at the first row of the partition. UNBOUNDED PRECEDING can be specified only as a window starting point.
unsigned value specification PRECEDING	Indicates the number of rows or values to precede the current row. This specification is not allowed for RANGE.
CURRENT ROW	Specifies that the window starts or ends at the current row when used with ROWS or the current value when used with RANGE. CURRENT ROW can be specified as both a starting and ending point.
BETWEEN <window frame bound > AND <window frame bound >	Used with either ROWS or RANGE to specify the lower (starting) and upper (ending) boundary points of the window. <window frame bound> defines the boundary starting point, and <window frame bound> defines the boundary end point. The upper bound cannot be smaller than the lower bound.
UNBOUNDED FOLLOWING	Specifies that the window ends at the last row of the partition. UNBOUNDED FOLLOWING can be specified only as a window end point. For example, RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING defines a window that starts with the current row and ends with the last row of the partition.
<unsigned value specification> FOLLOWING	Specified with <unsigned value specification> to indicate the number of rows or values to follow the current row. When <unsigned value specification>FOLLOWING is specified as the window starting point, the ending point must be <unsigned value specification> FOLLOWING. For example, ROWS BETWEEN 2 FOLLOWING AND 10 FOLLOWING defines a window that starts with the second row that follows the current row and ends with the tenth row that follows the current row. This specification is not allowed for RANGE.
unsigned integer literal	Is a positive integer literal (including 0) that specifies the number of rows or values to precede or follow the current row or value. This specification is valid only for ROWS.

Each of the functions allows for and requires various clauses of the OVER clause.

Windowed Aggregate Functions

With the exception of the GROUPING function, all of the aggregate functions can be windowed through the OVER clause. Additionally, the new ROWS | RANGE clause allows you to perform running aggregations and sliding aggregations.

Most of the recipes in the “Windowed Aggregate Functions” section utilize the following table and data:

CREATE TABLE #Transactions
    (
    AccountId INTEGER,
    TranDate DATE,
    TranAmt NUMERIC(8, 2)
    );
INSERT INTO #Transactions
SELECT *
FROM ( VALUES ( 1, '2011-01-01', 500),
              ( 1, '2011-01-15', 50),
              ( 1, '2011-01-22', 250),
              ( 1, '2011-01-24', 75),
              ( 1, '2011-01-26', 125),
              ( 1, '2011-01-28', 175),
              ( 2, '2011-01-01', 500),
              ( 2, '2011-01-15', 50),
              ( 2, '2011-01-22', 25),
              ( 2, '2011-01-23', 125),
              ( 2, '2011-01-26', 200),
              ( 2, '2011-01-29', 250),
              ( 3, '2011-01-01', 500),
              ( 3, '2011-01-15', 50 ),
              ( 3, '2011-01-22', 5000),
              ( 3, '2011-01-25', 550),
              ( 3, '2011-01-27', 95 ),
              ( 3, '2011-01-30', 2500)
    ) dt (AccountId, TranDate, TranAmt);

7-8. Calculating Totals Based Upon the Prior Row

Problem

You need to calculate the total of a column where the total is the sum of the column through the current row. For instance, for each account, calculate the total transaction amount to date in date order.

Solution

Utilize the SUM function with the OVER clause to perform a running total.

SELECT AccountId,
       TranDate,
       TranAmt,
       -- running total of all transactions
       RunTotalAmt = SUM(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate)
FROM   #Transactions AS t
ORDER  BY AccountId,
       TranDate;

This query returns the following result set:

How It Works

The OVER clause, when used in conjunction with the SUM function, allows us to perform a running total of the transaction. Within the OVER clause, the PARTITION BY clause is specified to restart the calculation every time the AccountId value changes. The ORDER BY clause is specified to determine in which order the rows should be calculated. Since the ROWS | RANGE clause is not specified, the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is utilized. When the query is executed, the TranAmt column from all of the rows prior to and including the current row is summed up and returned.

Running aggregations can be performed over the other aggregate functions also. In this next example, the query is modified to perform running averages, counts, and minimum/maximum calculations.

SELECT AccountId,
       TranDate,
       TranAmt,
       -- running average of all transactions
       RunAvg = AVG(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate),
       -- running total # of transactions
       RunTranQty = COUNT(*) OVER (PARTITION BY AccountId ORDER BY TranDate),
       -- smallest of the transactions so far
       RunSmallAmt = MIN(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate),
       -- largest of the transactions so far
       RunLargeAmt = MAX(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate),
       -- running total of all transactions
       RunTotalAmt = SUM(TranAmt) OVER (PARTITION BY AccountId ORDER BY TranDate)
FROM   #Transactions AS t
ORDER BY AccountId,
         TranDate;

This query returns the following result set:

7-9. Calculating Totals Based Upon a Subset of Rows

Problem

When performing these aggregations, you want only the current row and the two previous rows to be considered for the aggregation.

Solution

Utilize the ROWS clause of the OVER clause.

SELECT AccountId,
       TranDate,
       TranAmt,
       -- average of the current and previous 2 transactions
       SlideAvg = AVG(TranAmt)
                  OVER (PARTITION BY AccountId
                            ORDER BY TranDate
                             ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
       -- total # of the current and previous 2 transactions
       SlideQty = COUNT(*)
                  OVER (PARTITION BY AccountId
                            ORDER BY TranDate
                             ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
       -- smallest of the current and previous 2 transactions
       SlideMin = MIN(TranAmt)
                  OVER (PARTITION BY AccountId
                            ORDER BY TranDate
                             ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
       -- largest of the current and previous 2 transactions
       SlideMax = MAX(TranAmt)
                  OVER (PARTITION BY AccountId
                            ORDER BY TranDate
                             ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
       -- total of the current and previous 2 transactions
       SlideTotal = SUM(TranAmt)
                  OVER (PARTITION BY AccountId
                            ORDER BY TranDate
                             ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM #Transactions AS t
ORDER BY AccountId,
      TranDate;

This query returns the following result set:

How It Works

The ROWS clause is added to the OVER clause of the aggregate functions to specify that the aggregate functions should look only at the current row and the previous two rows for their calculations. As you look at each column in the result set, you can see that the aggregation was performed over just these rows (the window of rows that the aggregation is applied to). As the query progresses through the result set, the window slides to encompass the specified rows relative to the current row.

7-10. Using a Logical Window

Problem

You want the rows being considered by the OVER clause to be affected by the value in the column instead of the physical ordering.

Solution

In the OVER clause, utilize the RANGE clause instead of the ROWS option.

DECLARE @Test TABLE
        (
        RowID INT IDENTITY,
        FName VARCHAR(20),
        Salary SMALLINT
        );
 
INSERT INTO @Test (FName, Salary)
VALUES ('George', 800),
       ('Sam', 950),
       ('Diane', 1100),
       ('Nicholas', 1250),
       ('Samuel', 1250),
       ('Patricia', 1300),
       ('Brian', 1500),
       ('Thomas', 1600),
       ('Fran', 2450),
       ('Debbie', 2850),
       ('Mark', 2975),
       ('James', 3000),
       ('Cynthia', 3000),
       ('Christopher', 5000);
 
SELECT RowID,
       FName,
       Salary,
       SumByRows = SUM(Salary)
                   OVER (ORDER BY Salary
                         ROWS UNBOUNDED PRECEDING),
       SumByRange = SUM(Salary)
                    OVER (ORDER BY Salary
                         RANGE UNBOUNDED PRECEDING)
FROM @Test
ORDER BY RowID;

This query returns the following result set:

How It Works

When utilizing the RANGE clause, the SUM function adjusts its window based upon the values in the specified column. The previous example shows the salary of your employees, and the SUM function is performing a running total of the salaries in order of the salary. For comparison purposes, the running total is being calculated with both the ROWS and RANGE clauses. There are two groups of employees that have the same salary: RowIDs 4 and 5 are both 1,250, and 12 and 13 are both 3,000. When the running total is calculated with the ROWS clause, you can see that the salary of the current row is being added to the prior total of the previous rows. However, when the RANGE clause is used, all of the rows that contain the value of the current row are totaled and added to the total of the previous value. The result is that for rows 4 and 5, both employees with a salary of 1,250 are added together for the running total (and this action is repeated for rows 12 and 13).

Ranking Functions

Ranking functions allow you to return a ranking value associated to each row in a partition of a result set. Depending on the function used, multiple rows may receive the same value within the partition, and there may be gaps between assigned numbers. Table 7-3 describes the four ranking functions.

Table 7-3. Ranking Functions

Function	Description
ROW_NUMBER	ROW_NUMBER returns an incrementing integer for each row within a partition of a set.
RANK	Similar to ROW_NUMBER, RANK increments its value for each row within a partition of the set. The key difference is if rows with tied values exist within the partition, they will receive the same rank value, and the next value will receive the rank value as if there had been no ties, producing a gap between assigned numbers.
DENSE_RANK	The difference between DENSE_RANK and RANK is that DENSE_RANK doesn’t have gaps in the rank values when there are tied values; the next value has the next rank assignment.
NTILE	NTILE divides the result set into a specified number of groups, based on the ordering and optional partition clause.

The syntax of the ranking functions is as follows:

ROW_NUMBER ( ) | RANK ( ) | DENSE_RANK ( ) | NTILE (integer_expression)
  OVER ( [ PARTITION BY value_expression , ... [ n ] ] order_by_clause )

where the optional PARTITION BY clause is a list of columns that control when to restart the numbering. If the PARTITION BY clause isn’t specified, all of the rows in the result set are treated as one partition. The ORDER BY clause determines the order in which the rows within a partition are assigned their unique row number value. For the NTILE function, the integer_expression is a positive integer constant expression.

7-11. Generating an Incrementing Row Number

Problem

You need to have a query return total sales information. You need to assign a row number to each row in order of the date of the order, and the numbering needs to start over for each account number.

Solution

Utilize the ROW_NUMBER function to assign row numbers to each row.

SELECT TOP 10
       AccountNumber,
       OrderDate,
       TotalDue,
       ROW_NUMBER() OVER (PARTITION BY AccountNumber ORDER BY OrderDate) AS RN
FROM Sales.SalesOrderHeader
ORDER BY AccountNumber;

This query returns the following result set:

How It Works

The ROW_NUMBER function is utilized to generate a row number for each row in the partition. The PARTITION_BY clause is utilized to restart the number generation for each change in the AccountNumber column. The ORDER_BY clause is utilized to order the numbering of the rows in the order of the OrderDate column.

You can also utilize the ROW_NUMBER function to create a virtual numbers, or tally, table.

Note A numbers, or tally, table is simply a table of sequential numbers, and it can be utilized to eliminate loops. Use your favorite Internet search tool to find information about what the numbers or tally table is and how it can replace loops. One excellent article is at www.sqlservercentral.com/articles/T-SQL/62867/.

For instance, the sys.all_columns system view has more than 8,000 rows. You can utilize this to easily build a numbers table with this code:

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN
FROM sys.all_columns;

This query will produce a row number for each row in the sys.all_columns view. In this instance, the ordering doesn’t matter, but it is required, so the ORDER BY clause is specified as "(SELECT NULL)". If you need more records than what are available in this table, you can simply cross join this table to itself, which will produce more than 64 million rows.

In this example, a table scan is required. Another method is to produce the numbers or tally table by utilizing constants. The following example creates a one million row virtual tally table without incurring any disk I/O operations:

WITH
TENS      (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
                  SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
                  SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),
THOUSANDS (N) AS (SELECT 1 FROM TENS t1 CROSS JOIN TENS t2 CROSS JOIN TENS t3),
MILLIONS  (N) AS (SELECT 1 FROM THOUSANDS t1 CROSS JOIN THOUSANDS t2),
TALLY     (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM MILLIONS)
SELECT N
FROM TALLY;

7-12. Returning Rows by Rank

Problem

You want to rank your salespeople based upon their sales quota.

Solution

Utilize the RANK function to rank your salespeople.

SELECT BusinessEntityID,
       QuotaDate,
       SalesQuota,
       RANK() OVER (ORDER BY SalesQuota DESC) AS RANK
FROM   Sales.SalesPersonQuotaHistory
WHERE  SalesQuota BETWEEN 266000.00 AND 319000.00;

This query returns the following result set:

How It Works

RANK assigns a ranking value to each row within a partition. If multiple rows within the partition tie with the same value, they are assigned the same ranking value (see rank 7 in the example). When there is a tie, the following value has a ranking value assigned as if none of the previous rows had ties. If there are no ties in the partition, the ranking value assigned is the same as if the ROW_NUMBER function had been used with the same OVER clause definition.

7-13. Returning Rows by Rank Without Gaps

Problem

You want to rank your salespeople based upon their sales quota without any gaps in the ranking value assigned.

Solution

Utilize the DENSE_RANK function to rank your salespeople without gaps.

SELECT BusinessEntityID,
       QuotaDate,
       SalesQuota,
       DENSE_RANK() OVER (ORDER BY SalesQuota DESC) AS DENSERANK
FROM   Sales.SalesPersonQuotaHistory
WHERE  SalesQuota BETWEEN 266000.00 AND 319000.00;

This query returns the following result set:

How It Works

DENSE_RANK assigns a ranking value to each row within a partition. If multiple rows within the partition tie with the same value, they are assigned the same ranking value (see rank 7 in the example). When there is a tie, the following value has the next ranking value assigned. If there are no ties in the partition, the ranking value assigned is the same as if the ROW_NUMBER function had been used with the same OVER clause definition.

7-14. Sorting Rows into Buckets

Problem

You want to split your salespeople up into four groups based upon their sales quota.

Solution

Utilize the NTILE function, and specify the number of groups to divide the result set into.

SELECT BusinessEntityID,
       QuotaDate,
       SalesQuota,
       NTILE(4) OVER (ORDER BY SalesQuota DESC) AS [NTILE]
FROM   Sales.SalesPersonQuotaHistory
WHERE  SalesQuota BETWEEN 266000.00 AND 319000.00;

This query produces the following result set:

How It Works

The NTILE function divides the result set into the specified number of groups based upon the partitioning and ordering specified in the OVER clause. Notice that the first two groups have three rows in that group, and the final two groups have two. If the number of rows in the result set is not evenly divisible by the specified number of groups, then the leading groups will have one extra row assigned to that group until the remainder has been accommodated.

7-15. Grouping Logically Consecutive Rows Together

Problem

You need to group logically consecutive rows together so that subsequent calculations can treat those rows identically. For instance, your manufacturing plant utilizes RFID tags to track the movement of your products. During the manufacturing process, a product may be rejected and sent back to an earlier part of the process to be corrected. You want to track the number of trips that a tag makes to an area. However, you have multiple sensors to detect the tags in your larger areas, so multiple consecutive hits from different sensors can be entered in the system. These consecutive entries need to be considered together based upon the time in the same area and should be treated as the same trip in your results.

This recipe will utilize the following data:

DECLARE @RFID_Location TABLE (
    TagId INTEGER,
    Location VARCHAR(25),
    SensorDate DATETIME);
INSERT INTO @RFID_Location
        (TagId, Location, SensorDate)
VALUES (1, 'Room1', '2012-01-10T08:00:01'),
       (1, 'Room1', '2012-01-10T08:18:32'),
       (1, 'Room2', '2012-01-10T08:25:42'),
       (1, 'Room3', '2012-01-10T09:52:48'),
       (1, 'Room2', '2012-01-10T10:05:22'),
       (1, 'Room3', '2012-01-10T11:22:15'),
       (1, 'Room4', '2012-01-10T14:18:58'),
       (2, 'Room1', '2012-01-10T08:32:18'),
       (2, 'Room1', '2012-01-10T08:51:53'),
       (2, 'Room2', '2012-01-10T09:22:09'),
       (2, 'Room1', '2012-01-10T09:42:17'),
       (2, 'Room1', '2012-01-10T09:59:16'),
       (2, 'Room2', '2012-01-10T10:35:18'),
       (2, 'Room3', '2012-01-10T11:18:42'),
       (2, 'Room4', '2012-01-10T15:22:18'),

Solution

Utilize two ROW_NUMBER functions, differing only in that the last column in the PARTITION BY clause has an extra column. The difference between these results will group logically consecutive rows together.

WITH cte AS
(
SELECT TagId, Location, SensorDate,
       ROW_NUMBER()
       OVER (PARTITION BY TagId
                 ORDER BY SensorDate) -
       ROW_NUMBER()
       OVER (PARTITION BY TagId, Location
                 ORDER BY SensorDate) AS Grp
FROM @RFID_Location
)
SELECT TagId, Location, SensorDate, Grp,
       DENSE_RANK()
       OVER (PARTITION BY TagId, Location
                 ORDER BY Grp) AS TripNbr
FROM  cte
ORDER BY TagId, SensorDate;

This query returns the following result set:

How It Works

The first ROW_NUMBER function partitions the result set by the TagId and assigns the row number as ordered by the SensorDate. The second ROW_NUMBER function partitions the result set by the TagId and Location and assigns the row number as ordered by the SensorDate. The difference between these will assign consecutive rows in the same location as the same Grp number. The previous results show that consecutive entries in the same location are assigned the same Grp number. The following query breaks the ROW_NUMBER functions down into individual columns so that you can see how this is performed:

WITH cte AS
(
SELECT TagId, Location, SensorDate,
       ROW_NUMBER()
       OVER (PARTITION BY TagId
                 ORDER BY SensorDate) AS RN1,
       ROW_NUMBER()
       OVER (PARTITION BY TagId, Location
                 ORDER BY SensorDate) AS RN2
FROM   @RFID_Location
)
SELECT TagId, Location, SensorDate,
       RN1, RN2, RN1-RN2 AS Grp
FROM   cte
ORDER BY TagId, SensorDate;

This query returns the following result set:

With this query, you can see that for each TagId, the RN1 column is sequentially numbered from 1 to the number of rows for that TagId. For the RN2 column, the Location is added to the PARTITION BY clause, resulting in the assigned row numbers being restarted every time the location changes. By looking at this result set, you can see that subtracting RN2 from RN1 returns a number where each trip to a location has a higher number than the previous trip, and consecutive reads in the same location are treated the same. It doesn’t matter that the calculated Grp column is not consecutive; it is the fact that it increases from the prior trip to this location that is critical. To handle calculating the trips, the DENSE_RANK function is utilized so that there will not be any gaps. The following query takes the first example and adds in the RANK function to illustrate the difference.

WITH cte AS
(
SELECT TagId, Location, SensorDate,
       ROW_NUMBER()
       OVER (PARTITION BY TagId
                 ORDER BY SensorDate) -
       ROW_NUMBER()
       OVER (PARTITION BY TagId, Location
                 ORDER BY SensorDate) AS Grp
FROM @RFID_Location
)
SELECT TagId, Location, SensorDate, Grp,
       DENSE_RANK()
       OVER (PARTITION BY TagId, Location
                 ORDER BY Grp) AS TripNbr,
       RANK()
       OVER (PARTITION BY TagId, Location
                 ORDER BY Grp) AS TripNbrRank
FROM  cte
ORDER BY TagId, SensorDate;

This query returns the following result set:

In this query, you can see how the RANK function returns the wrong trip number for TagId 2 for the second trip to Room1 (the fourth and fifth rows for this tag).

Analytic Functions

New to SQL Server 2012 are the analytic functions. Analytic functions compute an aggregate value on a group of rows. In contrast to the aggregate functions, they can return multiple rows for each group. Table 7-4 describes the analytic functions.

Table 7-4. Analytic Functions

Function	Description
CUME_DIST	CUME_DIST calculates the cumulative distribution of a value in a group of values. The cumulative distribution is the relative position of a specified value in a group of values.
FIRST_VALUE	Returns the first value from an ordered set of values.
LAG	Retrieves data from a previous row in the same result set as specified by a row offset from the current row.
LAST_VALUE	Returns the last value from an ordered set of values.
LEAD	Retrieves data from a subsequent row in the same result set as specified by a row offset from the current row.
PERCENTILE_CONT	Calculates a percentile based on a continuous distribution of the column value. The value returned may or may not be equal to any of the specific values in the column.
PERCENTILE_DISC	Computes a specific percentile for sorted values in the result set. The value returned will be the value with the smallest CUME_DIST value (for the same sort specification) that is greater than or equal to the specified percentile. The value returned will be equal to one of the values in the specific column.
PERCENT_RANK	Computes the relative rank of a row within a set.

The analytic functions come in complementary pairs and will be discussed in this manner.

7-16. Accessing Values from Other Rows

Problem

You need to write a sales summary report that shows the total due from orders by year and quarter. You want to include a difference between the current quarter and prior quarter, as well as a difference between the current quarter of this year and the same quarter of the previous year.

Solution

Aggregate the total due by year and quarter, and utilize the LAG function to look at the previous records.

WITH cte AS
(
SELECT DATEPART(QUARTER, OrderDate) AS Qtr,
       DATEPART(YEAR, OrderDate) AS Yr,
       TotalDue
FROM   Sales.SalesOrderHeader
), cteAgg AS
(
SELECT Yr,
       Qtr,
       SUM(TotalDue) AS TotalDue
FROM   cte
GROUP BY Yr, Qtr
)
SELECT Yr,
       Qtr,
       TotalDue,
       TotalDue - LAG(TotalDue, 1, NULL)
                  OVER (ORDER BY Yr, Qtr) AS DeltaPriorQtr,
       TotalDue - LAG(TotalDue, 4, NULL)
                  OVER (ORDER BY Yr, Qtr) AS DeltaPriorYrQtr
FROM cteAgg
ORDER BY Yr, Qtr;

This query returns the following result set:

How It Works

The first CTE is utilized to retrieve the year and quarter from the OrderDate column and to pass the TotalDue column to the rest of the query. The second CTE is used to aggregate the TotalDue column, grouping on the extracted Yr and Qtr columns. The final SELECT statement returns these aggregated values and then makes two calls to the LAG function. The first call retrieves the TotalDue column from the previous row in order to compute the difference between the current quarter and the previous quarter. The second call retrieves the TotalDue column from four rows prior to the current row in order to compute the difference between the current quarter and the same quarter one year ago.

The LEAD function works in a similar manner. In this example, a table is created that has gaps in the column. The table is then queried, comparing the value in the current row to the value in the next row. If the difference is greater than 1, then a gap exists and is returned in the result set. This solution is based on a method that I learned from Itzik Ben-Gan.

DECLARE @Gaps TABLE (col1 int PRIMARY KEY CLUSTERED);
 
INSERT INTO @Gaps (col1)
VALUES (1), (2), (3),
       (50), (51), (52), (53), (54), (55),
       (100), (101), (102),
       (500),
       (950), (951), (952),
       (954);
 
-- Compare the value of the current row to the next row.
-- If > 1, then there is a gap.
WITH cte AS
(
SELECT col1 AS CurrentRow,
       LEAD(col1, 1, NULL)
       OVER (ORDER BY col1) AS NextRow
FROM   @Gaps
)
SELECT cte.CurrentRow + 1 AS [Start of Gap],
       cte.NextRow - 1 AS [End of Gap]
FROM cte
WHERE cte.NextRow - cte.CurrentRow > 1;

This query returns the following result set:

Start of Gap     End of Gap
------------     ----------
4                49
56               99
103              499
501              949
953              953

The syntax for the LAG and LEAD functions is as follows:

LAG | LEAD (scalar_expression [,offset] [,default])
  OVER ( [ partition_by_clause ] order_by_clause )

Where scalar_expression is an expression of any type that returns a scalar value (typically a column), offset is the number of rows to offset the current row by, and default is the value to return if the value returned is NULL. The default value for offset is 1, and the default value for default is NULL.

7-17. Accessing the First or Last Value from a Partition

Problem

You need to write a report that shows, for each customer, the date that they placed their least and most expensive orders.

Solution

Utilize the FIRST_VALUE and LAST_VALUE functions.

SELECT DISTINCT TOP (5)
       CustomerID,
       FIRST_VALUE(OrderDate)
       OVER (PARTITION BY CustomerID
                 ORDER BY TotalDue
                  ROWS BETWEEN UNBOUNDED PRECEDING
                          AND UNBOUNDED FOLLOWING) AS OrderDateLow,
       LAST_VALUE(OrderDate)
       OVER (PARTITION BY CustomerID
                 ORDER BY TotalDue
                  ROWS BETWEEN UNBOUNDED PRECEDING
                          AND UNBOUNDED FOLLOWING) AS OrderDateHigh
FROM   Sales.SalesOrderHeader
ORDER BY CustomerID;

This query returns the following result set for the first five customers:

CustomerID	OrderDateLow	OrderDateHigh
-----------	-----------------------	-----------------------
11000	2007-07-22 00:00:00.000	2005-07-22 00:00:00.000
11001	2008-06-12 00:00:00.000	2005-07-18 00:00:00.000
11002	2007-07-04 00:00:00.000	2005-07-10 00:00:00.000
11003	2007-07-09 00:00:00.000	2005-07-01 00:00:00.000
11004	2007-07-26 00:00:00.000	2005-07-26 00:00:00.000

How It Works

The FIRST_VALUE and LAST_VALUE functions are used to return the OrderDate for the first and last rows in the partition. The window is set to a partition of the CustomerID, ordered by the TotalDue, and the ROWS clause is used to specify all of the rows for the partition (CustomerId). The syntax for the FIRST_VALUE and LAST_VALUE functions is as follows:

FIRST_VALUE ( scalar_expression )
  OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )

where scalar_expression is an expression of any type that returns a scalar value (typically a column).

7-18. Calculating the Relative Position or Rank of a Value in a Set of Values

Problem

You want to know the relative position and rank of a customer’s order by the total of the order in respect to the total of all the customers’ orders.

Solution

Utilize the CUME_DIST and PERCENT_RANK functions to obtain the relative position of a value and the relative rank of a value.

SELECT CustomerID,
       CUME_DIST()
       OVER (PARTITION BY CustomerID
                 ORDER BY TotalDue) AS CumeDistOrderTotalDue,
       PERCENT_RANK()
       OVER (PARTITION BY CustomerID
                 ORDER BY TotalDue) AS PercentRankOrderTotalDue
FROM   Sales.SalesOrderHeader
ORDER BY CustomerID;

This code returns the following abridged result set:

CustomerID	CumeDistOrderTotalDue	PercentRankOrderTotalDue
----------	---------------------	------------------------
30116	0.25	0
30116	0.5	0.333333333333333
30116	0.75	0.666666666666667
30116	1	1
30117	0.0833333333333333	0
30117	0.166666666666667	0.0909090909090909
30117	0.25	0.181818181818182
30117	0.333333333333333	0.272727272727273
30117	0.416666666666667	0.363636363636364
30117	0.5	0.454545454545455
30117	0.583333333333333	0.545454545454545
30117	0.666666666666667	0.636363636363636
30117	0.75	0.727272727272727
30117	0.833333333333333	0.818181818181818
30117	0.916666666666667	0.909090909090909
30117	1	1
30118	0.125	0
30118	0.25	0.142857142857143
30118	0.375	0.285714285714286
30118	0.5	0.428571428571429
30118	0.625	0.571428571428571
30118	0.75	0.714285714285714
30118	0.875	0.857142857142857
30118	1	1

How It Works

The CUME_DIST function returns the relative position of a value in a set of values, while the PERCENT_RANK function returns the relative rank of a value in a set of values. The syntax of these functions is as follows:

CUME_DIST() | PERCENT_RANK( )
  OVER ( [ partition_by_clause ] order_by_clause )

The result returned by these functions will be a float(53) data type, with the value being greater than 0 and less than or equal to 1 (0 < x <= 1). When utilizing these functions, NULL values will be included, and the value returned will be the lowest possible value.

7-19. Calculating Continuous or Discrete Percentiles

Problem

You want to see the median salary and the 75 percentile salary for all employees per department.

Solution

Utilize the PERCENTILE_CONT and PERCENTILE_DISC functions to return percentile calculations based upon a value at a specified percentage.

DECLARE @Employees TABLE
        (
        EmplId INT PRIMARY KEY CLUSTERED,
        DeptId INT,
        Salary NUMERIC(8, 2)
        );
 
INSERT INTO @Employees
VALUES (1, 1, 10000),
       (2, 1, 11000),
       (3, 1, 12000),
       (4, 2, 25000),
       (5, 2, 35000),
       (6, 2, 75000),
       (7, 2, 100000);
 
SELECT EmplId,
       DeptId,
       Salary,
       PERCENTILE_CONT(0.5)
        WITHIN GROUP (ORDER BY Salary ASC)
        OVER (PARTITION BY DeptId) AS MedianCont,
       PERCENTILE_DISC(0.5)
        WITHIN GROUP (ORDER BY Salary ASC)
        OVER (PARTITION BY DeptId) AS MedianDisc,
       PERCENTILE_CONT(0.75)
        WITHIN GROUP (ORDER BY Salary ASC)
        OVER (PARTITION BY DeptId) AS Percent75Cont,
       PERCENTILE_DISC(0.75)
        WITHIN GROUP (ORDER BY Salary ASC)
        OVER (PARTITION BY DeptId) AS Percent75Disc,
       CUME_DIST()
        OVER (PARTITION BY DeptId
                 ORDER BY Salary) AS CumeDist
FROM @Employees
ORDER BY DeptId, EmplId;

This query returns the following result set:

How It Works

PERCENTILE_CONT calculates a percentile based upon a continuous distribution of values of the specified column. This is performed by using the specified percentile value (SP) and the number of rows in the partition (N) and by computing the row number of interest (RN) after the ordering has been applied. The row number of interest is computed from the formula RN = (1 + (SP * (N- 1))). The result returned is the linear interpolation (essentially an average) between the values from the rows at CEILING(RN) and FLOOR(RN). The value returned may or may not exist in the partition being analyzed.

PERCENTILE_DISC calculates a percentile based upon a discrete distribution of the column values. For the specified percentile (P), the values of the partition are sorted, and the value returned will be from the row with the smallest CUME_DIST value (with the same ordering) that is greater than or equal to P. The value returned will exist in one of the rows in the partition being analyzed. Since the result for this function is based on the CUME_DIST value, that function was included in the previous query in order to show its value.

The syntax for these functions is as follows:

PERCENTILE_CONT ( numeric_literal) | PERCENTILE_DISC ( numeric_literal )
  WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] )
  OVER ( [ <partition_by_clause> ] )

In the example, PERCENTILE_CONT(0.5) is utilized to obtain the median value. For DeptId = 1, there are three rows, so the median value is the value from the middle row (after sorting). For DeptId = 2, there are four rows, so the median value is the average of the two middle rows. This value does not exist in this partition.

PERCENTILE_DISC returns a value that exists in the partition, based upon the CUME_DIST value. By specifying PERCENTILE_DISC(0.5), the row in each partition that has a CUME_DIST value of .5, or the next row that is greater than .5, is utilized, and the salary for that row is returned.

Sequences

Sequences are used to create an incrementing number. While similar to an identity column, they are not bound to any table, can be reset, and can be used across multiple tables. Sequences are discussed in detail in Recipe 13-22. Sequences are assigned by calling the NEXT VALUE FOR function, and multiple values can be assigned simultaneously. The order of these assignments can be controlled by use of the optional OVER clause of the NEXT VALUE FOR function.

7-20. Assigning Sequences in a Specified Order

Problem

You are inserting multiple student grades into a table. Each record needs to have a sequence assigned, and you want the sequences to be assigned in order of the grades.

Solution

Utilize the OVER clause of the NEXT VALUE FOR function, specifying the desired order.

IF EXISTS (SELECT 1
        FROM sys.sequences AS seq
        JOIN sys.schemas AS sch
        ON seq.schema_id = sch.schema_id
        WHERE sch.name = 'dbo'
        AND seq.name = 'CH7Sequence')
  DROP SEQUENCE dbo.CH7Sequence;
 
CREATE SEQUENCE dbo.CH7Sequence AS INTEGER START WITH 1;
 
DECLARE @ClassRank TABLE
        (
        StudentID TINYINT,
        Grade TINYINT,
        SeqNbr INTEGER
        );
INSERT INTO @ClassRank (StudentId, Grade, SeqNbr)
SELECT StudentId,
       Grade,
       NEXT VALUE FOR dbo.CH7Sequence OVER (ORDER BY Grade ASC)
FROM (VALUES (1, 100),
             (2, 95),
             (3, 85),
             (4, 100),
             (5, 99),
             (6, 98),
             (7, 95),
             (8, 90),
             (9, 89),
             (10, 89),
             (11, 85),
             (12, 82)) dt(StudentId, Grade);
 
SELECT StudentId, Grade, SeqNbr
FROM   @ClassRank;

This query returns the following result set:

StudentID	Grade	SeqNbr
---------	-----	------
12	82	1
3	85	2
11	85	3
10	89	4
9	89	5
8	90	6
7	95	7
2	95	8
6	98	9
5	99	10
1	100	11
4	100	12

How It Works

The optional OVER clause of the NEXT VALUE FOR function is utilized to specify the order that the sequence should be applied. The syntax is as follows:

NEXT VALUE FOR [ database_name . ] [ schema_name . ] sequence_name
  [ OVER (<over_order_by_clause>) ]

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Aggregations and Windowing

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7: Aggregations and Windowing