Patterns and Anti-Patterns
In short, no pattern is an isolated entity. Each pattern can exist in the world only to the extent that is supported by other patterns: the larger patterns in which it is embedded, the patterns of the same size that surround it, and the smaller patterns which are embedded in it.
—Christopher Alexander, architect and design theorist
There is an old saying that you shouldn’t try to reinvent the wheel, and honestly, in essence it is a very good saying. But with all such sayings, a modicum of common sense is required for its application. If everyone down through history took the saying literally, your car would have wheels made out of the trunk of a tree (which the MythBusters proved you could do in their “Good Wood” episode), since that clearly could have been one of the first wheel-like machines that was used. If everyone down through history had said “that’s good enough,” driving to Walley World in the family truckster would be a far less comfortable experience.
Over time, however, the basic concept of a wheel has been intact, from rock wheel, to wagon wheel, to steel-belted radials, and even a wheel of cheddar. Each of these is round and able to move by rolling from place A to place B. Each solution follows that common pattern but diverges to solve a particular problem. The goal of a software programmer should be to first try understanding existing techniques and then either use or improve them. Solving the same problem over and over without any knowledge of the past is nuts.
One of the neat things about software design, certainly database design, is that there are base patterns, such as normalization, that we will build upon, but there are additional patterns that are built up starting with normalized tables. That is what this chapter is about, taking the basic structures we have built so far, and taking more and more complex groupings of structures, we will produce more complex, interesting solutions to problems.
Of course, in as much as there are positive patterns that work, there are also negative patterns that have failed over and over down through history. Take personal flight. For many, many years, truly intelligent people tried over and over to strap wings on their arms and fly. They were close in concept, but just doing the same thing over and over was truly folly. Once it was understood how to apply Bernoulli’s principle to building wings and what it would truly take to fly, the Wright brothers applied these principals to produce the first manned flying machine. If you ever happen by Kitty Hawk, NC, you can see the plane and location of that flight. Not an amazing amount has changed between that airplane and today’s airplanes in basic principle. They weren’t required to have their entire body scanned and patted down for that first flight, but the wings worked the same way.
Throughout this book so far, we have covered the basic implementation tools that you can use to assemble solutions that meet your real-world needs. In this chapter, I am going to extend this notion and present a few deeper examples where we assemble a part of a database that deals with common problems that show up in almost any database solution. The chapter is broken up into two major sections, starting with the larger topic, patterns that are desirable to use. The second major section discusses anti-patterns, or patterns that you may frequently see that are not desirable to use (along with the preferred method of solution, naturally).
Desirable Patterns
In this section, I am going to cover a variety of implementation patterns that can be used to solve very common problems that you will frequently encounter. By no means should this be confused with a comprehensive list of the types of problems you may face; think of it instead as a sampling of methods of solving some very common problems.
The patterns and solutions that I will present in the following subsections are as follows:
Note I am always looking for other patterns that can solve common issues and enhance your designs (as well as mine). On my web site (www.drsql.org), I may make additional entries available over time, and please leave me comments if you have ideas for more.
If you have been reading this book straight through, you likely have gotten the point that uniqueness is a major concern for your design. The fact is, uniqueness is one of the largest problems you will tackle when designing a database, because telling two rows apart from one another can be a very difficult task in some cases.
In this section, we will explore how you can implement different types of uniqueness issues that hit at the heart of the common problems you will come across:
Uniqueness is one of the biggest struggles in day-to-day operations, particularly in running a company, as it is essential to not offend customers, nor ship them 100 orders of Legos when they actually only placed a single order. We need to make sure that we don’t end up with ten employees with the same SSN (and a visit from the tax man), far fewer cans of corn than we expected, ten appointments at the same time, and so on.
We previously discussed PRIMARY KEY and UNIQUE constraints, but neither of these will fit the scenario where you need to make sure some subset of the data, rather than every row, is unique. For example, say you have an employee table, and each employee can possibly have an insurance policy. The policy numbers must be unique, but the user might not have a policy.
There are three solutions to this problem that are common:
As a demonstration, I will create a schema and table for the human resources employee table with a column for employee number and a column for insurance policy number as well. I will use a database named Chapter8 with default settings for the examples unless otherwise noted.
CREATE SCHEMA HumanResources;
GO
CREATE TABLE HumanResources.Employee
(
EmployeeId int IDENTITY(1,1) CONSTRAINT PKEmployee primary key,
EmployeeNumber char(5) NOT NULL
CONSTRAINT AKEmployee_EmployeeNummer UNIQUE,
--skipping other columns you would likely have
InsurancePolicyNumber char(10) NULL
);
Filtered indexes are useful for performance-tuning situations where only a few values are selective, but they also are useful for eliminating values for data protection. Everything about the index is the same as a normal index (indexes will be covered in greater detail in Chapter 10) save for the WHERE clause. So, you add an index like this:
--Filtered Alternate Key (AKF)
CREATE UNIQUE INDEX AKFEmployee_InsurancePolicyNumber ON
HumanResources.Employee(InsurancePolicyNumber)
WHERE InsurancePolicyNumber IS NOT NULL;
Then, create an initial sample row:
INSERT INTO HumanResources.Employee (EmployeeNumber, InsurancePolicyNumber)
VALUES (’A0001’,’1111111111’);
If you attempt to give another employee the same InsurancePolicyNumber
INSERT INTO HumanResources.Employee (EmployeeNumber, InsurancePolicyNumber)
VALUES (’A0002’,’1111111111’);
this fails:
Msg 2601, Level 14, State 1, Line 29
Cannot insert duplicate key row in object ’HumanResources.employee’ with unique index ’AKFEmployee_InsurancePolicyNumber’. The duplicate key value is (1111111111).
Adding the row with the corrected value will succeed:
INSERT INTO HumanResources.Employee (EmployeeNumber, InsurancePolicyNumber)
VALUES (’A0002’,’2222222222’);
However, adding two rows with NULL will work fine:
INSERT INTO HumanResources.Employee (EmployeeNumber, InsurancePolicyNumber)
VALUES (’A0003’,’3333333333’),
(’A0004’,NULL),
(’A0005’,NULL);
You can see that this
SELECT *
FROM HumanResources.Employee;
returns the following:
EmployeeId EmployeeNumber InsurancePolicyNumber
---------------- ------------------------ --------------------------------
1 A0001 1111111111
3 A0002 2222222222
4 A0003 3333333333
5 A0004 NULL
5 A0005 NULL
The NULL example is the classic example, because it is common to desire this functionality. However, this technique can be used for more than just NULL exclusion. As another example, consider the case where you want to ensure that only a single row is set as primary for a group of rows, such as a primary contact for an account:
CREATE SCHEMA Account;
GO
CREATE TABLE Account.Contact
(
ContactId varchar(10) NOT NULL,
AccountNumber char(5) NOT NULL, --would be FK in full example
PrimaryContactFlag bit NOT NULL,
CONSTRAINT PKContact PRIMARY KEY(ContactId, AccountNumber)
);
Again, create an index, but this time, choose only those rows with PrimaryContactFlag = 1. The other values in the table could have as many other values as you want (of course, in this case, since it is a bit, the values could be only 0 or 1).
CREATE UNIQUE INDEX AKFContact_PrimaryContact
ON Account.Contact(AccountNumber) WHERE PrimaryContactFlag = 1;
If you try to insert two rows that are primary, as in the following statements that will set both contacts ’fred’ and ’bob’ as the primary contact for the account with account number ’11111’:
INSERT INTO Account.Contact
VALUES (’bob’,’11111’,1);
GO
INSERT INTO Account.Contact
VALUES (’fred’,’11111’,1);
the following error is returned after the second insert:
Msg 2601, Level 14, State 1, Line 73
Cannot insert duplicate key row in object ’Account.Contact’ with unique index ’AKFContact_PrimaryContact’. The duplicate key value is (11111).
To insert the row with ’fred’ as the name and set it as primary (assuming the ’bob’ row was inserted previously), you will need to update the other row to be not primary and then insert the new primary row:
BEGIN TRANSACTION;
UPDATE Account.Contact
SET PrimaryContactFlag = 0
WHERE accountNumber = ’11111’;
INSERT Account.Contact
VALUES (’fred’,’11111’, 1);
COMMIT TRANSACTION;
Note that in cases like this you would definitely want to use a transaction and error handling in your code so you don’t end up without a primary contact if the INSERT operation fails for some other reason.
Prior to SQL Server 2008, where there were no filtered indexes, the preferred method of implementing this was to create an indexed view with a unique clustered indexes. There are a couple of other ways to do this (such as in a trigger or stored procedure using an EXISTS query, or even using a user-defined function in a CHECK constraint), but the indexed view is the easiest if you cannot use a filtered index (though as of this writing, it is available in all supported versions of SQL Server.
A side effect of the filtered index is that it (like the uniqueness constraints we have used previously) has a very good chance of being useful for searches against the table. The only downside is that the error comes from an index rather than a constraint, so it does not fit into our existing paradigms for error handling.
Bulk Uniqueness
Sometimes, we need to inventory items where some of the items are equivalent in the physical world, for example, cans of corn in the grocery store. Generally, you can’t even tell the cans apart by looking at them (unless they have different expiration dates, perhaps), but knowing how many are in stock is a very common need. Implementing a solution that has a row for every canned good in a corner market would require a very large database even for a very small store. This would be really quite complicated and would require a heck of a lot of rows and data manipulation. It would, in fact, make some queries easier, but it would make data storage a lot more difficult.
Instead of having one row for each individual item, you can implement a row per type of item. This type would be used to store inventory and utilization, which would then be balanced against one another. Figure 8-1 shows a very simplified model of such activity.
Figure 8-1. Simplified inventory model
In the InventoryAdjustment table, you would record shipments coming in, items stolen, changes to inventory after taking inventory (could be more or less, depending on the quality of the data you had), and so forth, and in the ProductSale table (probably a child to sale header or perhaps invoicing table in a complete model), you would record when product is removed or added to inventory from a customer interaction.
The sum of the InventoryAdjustment Quantity value less the sum of ProductSale Quantity value should tell you the amount of product on hand (or perhaps the amount of product you have oversold and need to order posthaste!) In the more realistic case, you would have a lot of complexity for backorders, future orders, returns, and so on, but the base concept is basically the same. Instead of each row representing a single item, each represents a handful of items.
The following miniature design is an example I charge students with when I give my day-long seminar on database design. It is referencing a collection of toys, many of which are exactly alike:
Example pieces and sets are shown in Figure 8-2.
Figure 8-2. Sample Lego parts for a database
To solve this problem, I will create a table for each set of Legos I own (which I will call Build, since “set” is a bad word for a SQL name, and “build” actually is better anyhow to encompass a personal creation):
CREATE SCHEMA Lego;
GO
CREATE TABLE Lego.Build
(
BuildId int CONSTRAINT PKBuild PRIMARY KEY,
Name varchar(30) NOT NULL CONSTRAINT AKBuild_Name UNIQUE,
LegoCode varchar(5) NULL, --five character set number
InstructionsURL varchar(255) NULL --where you can get the PDF of the instructions
);
Then, I’ll add a table for each individual instance of that build, which I will call BuildInstance:
CREATE TABLE Lego.BuildInstance
(
BuildInstanceId Int CONSTRAINT PKBuildInstance PRIMARY KEY ,
BuildId Int CONSTRAINT FKBuildInstance$isAVersionOf$LegoBuild
REFERENCES Lego.Build (BuildId),
BuildInstanceName varchar(30) NOT NULL, --brief description of item
Notes varchar(1000) NULL, --longform notes. These could describe modifications
--for the instance of the model
CONSTRAINT AKBuildInstance UNIQUE(BuildId, BuildInstanceName)
);
The next task is to create a table for each individual piece type. I used the term “piece” as a generic version of the different sorts of pieces you can get for Legos, including the different accessories:
CREATE TABLE Lego.Piece
(
PieceId int CONSTRAINT PKPiece PRIMARY KEY,
Type varchar(15) NOT NULL,
Name varchar(30) NOT NULL,
Color varchar(20) NULL,
Width int NULL,
Length int NULL,
Height int NULL,
LegoInventoryNumber int NULL,
OwnedCount int NOT NULL,
CONSTRAINT AKPiece_Definition UNIQUE (Type,Name,Color,Width,Length,Height),
CONSTRAINT AKPiece_LegoInventoryNumber UNIQUE (LegoInventoryNumber)
);
Note that I implement the owned count as an attribute of the piece and not as a multivalued attribute to denote inventory change events. In a fully fleshed-out sales model, this might not be sufficient, but for a personal inventory, it would be a reasonable solution. The likely use here will be to update the value as new pieces are added to inventory and possibly to count up loose pieces later and add that value to the ones in sets (which we will have a query for later).
Next, I will implement the table to allocate pieces to different builds:
CREATE TABLE Lego.BuildInstancePiece
(
BuildInstanceId int NOT NULL,
PieceId int NOT NULL,
AssignedCount int NOT NULL,
CONSTRAINT PKBuildInstancePiece PRIMARY KEY (BuildInstanceId, PieceId)
);
From here, I can load some data. I will load a true Lego item that is available for sale and that I have often given away during presentations. It is a small, black, one-seat car with a little guy in a sweatshirt.
INSERT Lego.Build (BuildId, Name, LegoCode, InstructionsURL)
VALUES (1,’Small Car’,’3177’,
’http://cache.lego.com/bigdownloads/buildinginstructions/4584500.pdf’);
I will create one instance for this, as I personally have only one in my collection (plus some boxed ones to give away):
INSERT Lego.BuildInstance (BuildInstanceId, BuildId, BuildInstanceName, Notes)
VALUES (1,1,’Small Car for Book’,NULL);
Then, I load the table with the different pieces in my collection, in this case, the types of pieces included in the set, plus some extras thrown in. (Note that in a fully fleshed-out design, some of these values would have domains enforced, as well as validations to enforce the types of items that have height, width, and/or lengths. This detail is omitted partially for simplicity, and partially because it might just be too much to implement for a system such as this, based on user needs—though mostly for simplicity of demonstrating the underlying principal of bulk uniqueness in the most compact possible manner.)
INSERT Lego.Piece (PieceId, Type, Name, Color, Width, Length, Height,
LegoInventoryNumber, OwnedCount)
VALUES (1, ’Brick’,’Basic Brick’,’White’,1,3,1,’362201’,20),
(2, ’Slope’,’Slope’,’White’,1,1,1,’4504369’,2),
(3, ’Tile’,’Groved Tile’,’White’,1,2,NULL,’306901’,10),
(4, ’Plate’,’Plate’,’White’,2,2,NULL,’302201’,20),
(5, ’Plate’,’Plate’,’White’,1,4,NULL,’371001’,10),
(6, ’Plate’,’Plate’,’White’,2,4,NULL,’302001’,1),
(7, ’Bracket’,’1x2 Bracket with 2x2’,’White’,2,1,2,’4277926’,2),
(8, ’Mudguard’,’Vehicle Mudguard’,’White’,2,4,NULL,’4289272’,1),
(9, ’Door’,’Right Door’,’White’,1,3,1,’4537987’,1),
(10,’Door’,’Left Door’,’White’,1,3,1,’45376377’,1),
(11,’Panel’,’Panel’,’White’,1,2,1,’486501’,1),
(12,’Minifig Part’,’Minifig Torso , Sweatshirt’,’White’,NULL,NULL,
NULL,’4570026’,1),
(13,’Steering Wheel’,’Steering Wheel’,’Blue’,1,2,NULL,’9566’,1),
(14,’Minifig Part’,’Minifig Head, Male Brown Eyes’,’Yellow’,NULL, NULL,
NULL,’4570043’,1),
(15,’Slope’,’Slope’,’Black’,2,1,2,’4515373’,2),
(16,’Mudguard’,’Vehicle Mudgard’,’Black’,2,4,NULL,’4195378’,1),
(17,’Tire’,’Vehicle Tire,Smooth’,’Black’,NULL,NULL,NULL,’4508215’,4),
(18,’Vehicle Base’,’Vehicle Base’,’Black’,4,7,2,’244126’,1),
(19,’Wedge’,’Wedge (Vehicle Roof)’,’Black’,1,4,4,’4191191’,1),
(20,’Plate’,’Plate’,’Lime Green’,1,2,NULL,’302328’,4),
(21,’Minifig Part’,’Minifig Legs’,’Lime Green’,NULL,NULL,NULL,’74040’,1),
(22,’Round Plate’,’Round Plate’,’Clear’,1,1,NULL,’3005740’,2),
(23,’Plate’,’Plate’,’Transparent Red’,1,2,NULL,’4201019’,1),
(24,’Briefcase’,’Briefcase’,’Reddish Brown’,NULL,NULL,NULL,’4211235’, 1),
(25,’Wheel’,’Wheel’,’Light Bluish Gray’,NULL,NULL,NULL,’4211765’,4),
(26,’Tile’,’Grilled Tile’,’Dark Bluish Gray’,1,2,NULL,’4210631’, 1),
(27,’Minifig Part’,’Brown Minifig Hair’,’Dark Brown’,NULL,NULL,NULL,
’4535553’, 1),
(28,’Windshield’,’Windshield’,’Transparent Black’,3,4,1,’4496442’,1),
--and a few extra pieces to make the queries more interesting
(29,’Baseplate’,’Baseplate’,’Green’,16,24,NULL,’3334’,4),
(30,’Brick’,’Basic Brick’,’White’,4,6,NULL,’2356’,10);
Next, I will assign the 43 pieces that make up the first set (with the most important part of this statement being to show you how cool the row constructor syntax is that was introduced in SQL Server 2008—this would have taken over 20 lines previously):
INSERT INTO Lego.BuildInstancePiece (BuildInstanceId, PieceId, AssignedCount)
VALUES (1,1,2),(1,2,2),(1,3,1),(1,4,2),(1,5,1),(1,6,1),(1,7,2),(1,8,1),(1,9,1),
(1,10,1),(1,11,1),(1,12,1),(1,13,1),(1,14,1),(1,15,2),(1,16,1),(1,17,4),
(1,18,1),(1,19,1),(1,20,4),(1,21,1),(1,22,2),(1,23,1),(1,24,1),(1,25,4),
(1,26,1),(1,27,1),(1,28,1);
Next, I will set up two other minimal builds to make the queries more interesting:
INSERT Lego.Build (BuildId, Name, LegoCode, InstructionsURL)
VALUES (2,’Brick Triangle’,NULL,NULL);
GO
INSERT Lego.BuildInstance (BuildInstanceId, BuildId, BuildInstanceName, Notes)
VALUES (2,2,’Brick Triangle For Book’,’Simple build with 3 white bricks’);
GO
INSERT INTO Lego.BuildInstancePiece (BuildInstanceId, PieceId, AssignedCount)
VALUES (2,1,3);
GO
INSERT Lego.BuildInstance (BuildInstanceId, BuildId, BuildInstanceName, Notes)
VALUES (3,2,’Brick Triangle For Book2’,’Simple build with 3 white bricks’);
GO
INSERT INTO Lego.BuildInstancePiece (BuildInstanceId, PieceId, AssignedCount)
VALUES (3,1,3);
After the mundane (and quite tedious when done all at once) business of setting up the data is passed, we can count the types of pieces we have in our inventory, and the total number of pieces we have using a query such as this:
SELECT COUNT(*) AS PieceCount, SUM(OwnedCount) AS InventoryCount
FROM Lego.Piece;
This query returns the following, with the first column giving us the different types:
PieceCount InventoryCount
----------- --------------
30 111
Here, you start to get a feel for how this is going to be a different sort of solution than the basic relational inventory solution. Instinctively, one expects that a single row represents one thing, but here, you see that, on average, each row represents four different pieces. Following this train of thought, we can group based on the generic type of piece using a query such as
SELECT Type, COUNT(*) AS TypeCount, SUM(OwnedCount) AS InventoryCount
FROM Lego.Piece
GROUP BY Type;
In these results, you can see that we have 2 types of brick but 30 bricks in inventory, 1 type of baseplate but 4 of them in inventory, and so on:
Type TypeCount InventoryCount
--------------- ----------- --------------
Baseplate 1 4
Bracket 1 2
Brick 2 30
Briefcase 1 1
Door 2 2
Minifig Part 4 4
Mudguard 2 2
Panel 1 1
Plate 5 36
Round Plate 1 2
Slope 2 4
Steering Wheel 1 1
Tile 2 11
Tire 1 4
Vehicle Base 1 1
Wedge 1 1
Wheel 1 4
Windshield 1 1
The biggest concern with this method is that users have to know the difference between a row and an instance of the thing the row is modeling. And it gets more interesting where the cardinality of the type is very close to the number of physical items on hand. With 30 types of item and only 111 actual pieces, users querying may not immediately see that they are getting a wrong count. In a system with 20 different products and 1 million pieces of inventory, it will be a lot more obvious.
In the next two queries, I will expand into actual interesting queries that you will likely want to use. First, I will look for pieces that are assigned to a given set, in this case, the small car model that we started with. To do this, we will just join the tables, starting with Build and moving on to BuildInstance, BuildInstancePiece, and Piece. All of these joins are inner joins, since we want items that are included in the set. I use grouping sets (another wonderful feature that comes in handy to give us a very specific set of aggregates—in this case, using the () notation to give us a total count of all pieces).
SELECT CASE WHEN GROUPING(Piece.Type) = 1 THEN ’--Total--’ ELSE Piece.Type END AS PieceType,
Piece.Color,Piece.Height, Piece.Width, Piece.Length,
SUM(BuildInstancePiece.AssignedCount) AS ASsignedCount
FROM Lego.Build
JOIN Lego.BuildInstance
ON Build.BuildId = BuildInstance.BuildId
JOIN Lego.BuildInstancePiece
ON BuildInstance.BuildInstanceId =
BuildInstancePiece.BuildInstanceId
JOIN Lego.Piece
ON BuildInstancePiece.PieceId = Piece.PieceId
WHERE Build.Name = ’Small Car’
AND BuildInstanceName = ’Small Car for Book’
GROUP BY GROUPING SETS((Piece.Type,Piece.Color, Piece.Height, Piece.Width, Piece.Length),
());
This returns the following, where you can see that 43 pieces go into this set:
PieceType Color Height Width Length AssignedCount
--------------- -------------------- ----------- ----------- ----------- -------------
Bracket White 2 2 1 2
Brick White 1 1 3 2
Briefcase Reddish Brown NULL NULL NULL 1
Door White 1 1 3 2
Minifig Part Dark Brown NULL NULL NULL 1
Minifig Part Lime Green NULL NULL NULL 1
Minifig Part White NULL NULL NULL 1
Minifig Part Yellow NULL NULL NULL 1
Mudguard Black NULL 2 4 1
Mudguard White NULL 2 4 1
Panel White 1 1 2 1
Plate Lime Green NULL 1 2 4
Plate Transparent Red NULL 1 2 1
Plate White NULL 1 4 1
Plate White NULL 2 2 2
Plate White NULL 2 4 1
Round Plate Clear NULL 1 1 2
Slope Black 2 2 1 2
Slope White 1 1 1 2
Steering Wheel Blue NULL 1 2 1
Tile Dark Bluish Gray NULL 1 2 1
Tile White NULL 1 2 1
Tire Black NULL NULL NULL 4
Vehicle Base Black 2 4 7 1
Wedge Black 4 1 4 1
Wheel Light Bluish Gray NULL NULL NULL 4
Windshield Transparent Black 1 3 4 1
--Total-- NULL NULL NULL NULL 43
The final query in this section is the more interesting one. A very common question would be, “How many pieces of a given type do I own that are not assigned to a set?” For this, I will use a common table expression (CTE) that gives me a sum of the pieces that have been assigned to a BuildInstance and then use that set to join to the Piece table:
;WITH AssignedPieceCount
AS (
SELECT PieceId, SUM(AssignedCount) AS TotalAssignedCount
FROM Lego.BuildInstancePiece
GROUP BY PieceId )
SELECT Type, Name, Width, Length,Height,
Piece.OwnedCount - Coalesce(TotalAssignedCount,0) AS AvailableCount
FROM Lego.Piece
LEFT OUTER JOIN AssignedPieceCount
on Piece.PieceId = AssignedPieceCount.PieceId
WHERE Piece.OwnedCount - Coalesce(TotalAssignedCount,0) > 0;
Because the cardinality of the AssignedPieceCount to the Piece table is zero or one to one, we can simply do an outer join and subtract the number of pieces we have assigned to sets from the amount owned. This returns
Type Name Width Length Height AvailableCount
---------- ------------- -------- -------- ------- --------------
Brick Basic Brick 1 3 1 12
Tile Groved Tile 1 2 NULL 9
Plate Plate 2 2 NULL 18
Plate Plate 1 4 NULL 9
Baseplate Baseplate 16 24 NULL 4
Brick Basic Brick 4 6 NULL 10
You can expand this basic pattern to almost any bulk uniqueness situation you may have. The calculation of how much inventory you have may be more complex and might include inventory values that are stored daily to avoid massive recalculations (think about how your bank account balance is set at the end of the day, and then daily transactions are added/subtracted as they occur until they too are posted and fixed in a daily balance).
In some cases, uniqueness isn’t uniqueness on the values of a single column set, but rather over the values between values. Very common examples of this include appointment times, college classes, or even teachers/employees who can only be assigned to one location at a time.
For example, consider an appointment time. It has a start and an end, and the start and end ranges of two appointments should not overlap. Suppose we have an appointment with start and end times defined with precision to the second, starting at ’20160712 1:00:00PM’ and ending at ’20160712 1:59:59PM’. To validate that this data does not overlap other appointments, we need to look for rows where any of the following conditions are met, indicating we are double booking appointment times:
We can protect against situations such as overlapping appointment times by employing a trigger and a query that checks for range overlapping. If the aforementioned conditions are not met, the new row is acceptable. We will implement a simplistic example of assigning a doctor to an office. Clearly, other parameters need to be considered, like office space, assistants, and so on, but I don’t want this section to be larger than the allotment of pages for the entire book. First, we create a table for the doctor and another to set appointments for the doctor:
CREATE SCHEMA Office;
GO
CREATE TABLE Office.Doctor
(
DoctorId int NOT NULL CONSTRAINT PKDoctor PRIMARY KEY,
DoctorNumber char(5) NOT NULL CONSTRAINT AKDoctor_DoctorNumber UNIQUE
);
CREATE TABLE Office.Appointment
(
AppointmentId int NOT NULL CONSTRAINT PKAppointment PRIMARY KEY,
--real situation would include room, patient, etc,
DoctorId int NOT NULL,
StartTime datetime2(0), --precision to the second
EndTime datetime2(0),
CONSTRAINT AKAppointment_DoctorStartTime UNIQUE (DoctorId,StartTime),
CONSTRAINT AKAppointment_DoctorEndTime UNIQUE (DoctorId,EndTime),
CONSTRAINT CHKAppointment_StartBeforeEnd CHECK (StartTime <= EndTime),
CONSTRAINT FKDoctor$IsAssignedTo$OfficeAppointment FOREIGN KEY (DoctorId)
REFERENCES Office.Doctor (DoctorId)
);
Next, we will add some data to our new table. The AppointmentId value 5 will include a bad date range that overlaps another row for demonstration purposes:
INSERT INTO Office.Doctor (DoctorId, DoctorNumber)
VALUES (1,’00001’),(2,’00002’);
INSERT INTO Office.Appointment
VALUES (1,1,’20160712 14:00’,’20160712 14:59:59’),
(2,1,’20160712 15:00’,’20160712 16:59:59’),
(3,2,’20160712 8:00’,’20160712 11:59:59’),
(4,2,’20160712 13:00’,’20160712 17:59:59’),
(5,2,’20160712 14:00’,’20160712 14:59:59’); --offensive item for demo, conflicts
--with 4
As far as the declarative constraints can tell, everything is okay, but the following query will check for data conditions between each row in the table to every other row in the table:
SELECT Appointment.AppointmentId,
Acheck.AppointmentId AS ConflictingAppointmentId
FROM Office.Appointment
JOIN Office.Appointment AS ACheck
ON Appointment.DoctorId = ACheck.DoctorId
/*1*/ and Appointment.AppointmentId <> ACheck.AppointmentId
/*2*/ and (Appointment.StartTime BETWEEN ACheck.StartTime and ACheck.EndTime
/*3*/ or Appointment.EndTime BETWEEN ACheck.StartTime and ACheck.EndTime
/*4*/ or (Appointment.StartTime < ACheck.StartTime
and Appointment.EndTime > ACheck.EndTime));
In this query, I have highlighted four points:
Running the query, we see that
AppointmentId ConflictingAppointmentId
------------- ------------------------
5 4
4 5
The interesting part of these results is that where there is one offending row, there will always be another. If one row is offending in one way, like starting before and ending after another appointment, the conflicting row will have a start and end time between the first appointment’s time. This won’t be a problem, but the shared blame makes the results more interesting to deal with.
Next, we remove the bad row for now:
DELETE FROM Office.Appointment WHERE AppointmentId = 5;
We will now implement a trigger (using the template as defined in Appendix B) that will check for this condition based on the values in new rows being inserted or updated. There’s no need to check deleted rows (even for the update case), because all a delete operation can do is help the situation (even in the case of the update, where you may move an appointment from one doctor to another).
Note that the basis of this trigger is the query we used previously to check for bad values (I usually implement this as two triggers, one for insert and another for update, both having the same code, but it is shown here as one for simplicity of demonstration):
CREATE TRIGGER Office.Appointment$insertAndUpdate
ON Office.Appointment
AFTER UPDATE, INSERT AS
BEGIN
SET NOCOUNT ON;
SET ROWCOUNT 0; --in case the client has modified the rowcount
--use inserted for insert or update trigger, deleted for update or delete trigger
--count instead of @@rowcount due to merge behavior that sets @@rowcount to a number
--that is equal to number of merged rows, not rows being checked in trigger
DECLARE @msg varchar(2000), --used to hold the error message
--use inserted for insert or update trigger, deleted for update or delete trigger
--count instead of @@rowcount due to merge behavior that sets @@rowcount to a number
--that is equal to number of merged rows, not rows being checked in trigger
@rowsAffected int = (SELECT COUNT(*) FROM inserted);
-- @rowsAffected int = (SELECT COUNT(*) FROM deleted);
--no need to continue on if no rows affected
IF @rowsAffected = 0 RETURN;
BEGIN TRY
--[validation section]
--if this is an update, but they don’t change times or doctor, don’t check the data
IF UPDATE(startTime) OR UPDATE(endTime) OR UPDATE(doctorId)
BEGIN
IF EXISTS ( SELECT *
FROM Office.Appointment
JOIN Office.Appointment AS ACheck
ON Appointment.doctorId = ACheck.doctorId
AND Appointment.AppointmentId <> ACheck.AppointmentId
AND (Appointment.StartTime BETWEEN Acheck.StartTime
AND Acheck.EndTime
OR Appointment.EndTime BETWEEN Acheck.StartTime
AND Acheck.EndTime
OR (Appointment.StartTime < Acheck.StartTime
and Appointment.EndTime > Acheck.EndTime))
WHERE EXISTS (SELECT *
FROM inserted
WHERE inserted.DoctorId = Acheck.DoctorId))
BEGIN
IF @rowsAffected = 1
SELECT @msg = ’Appointment for doctor ’ + doctorNumber +
’ overlapped existing appointment’
FROM inserted
JOIN Office.Doctor
ON inserted.DoctorId = Doctor.DoctorId;
ELSE
SELECT @msg = ’One of the rows caused an overlapping ’ +
’appointment time for a doctor’;
THROW 50000,@msg,16;
END;
END;
--[modification section]
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0
ROLLBACK TRANSACTION;
THROW; --will halt the batch or be caught by the caller’s catch block
END CATCH;
END;
Next, as a refresher, check out the data that is in the table:
SELECT *
FROM Office.Appointment;
This returns (or at least it should, assuming you haven’t deleted or added extra data)
appointmentId doctorId startTime endTime
------------- ----------- ---------------------- ----------------------
1 1 2016-07-12 14:00:00 2016-07-12 14:59:59
2 1 2016-07-12 15:00:00 2016-07-12 16:59:59
3 2 2016-07-12 08:00:00 2016-07-12 11:59:59
4 2 2016-07-12 13:00:00 2016-07-12 17:59:59
This time, when we try to add an appointment for doctorId number 1:
INSERT INTO Office.Appointment
VALUES (5,1,’20160712 14:00’,’20160712 14:59:59’);
this first attempt is blocked because the row is an exact duplicate of the start time value. The most common error that will likely occur in a system such as this is trying to duplicate something, usually by accident.
Msg 2627, Level 14, State 1, Line 2
Violation of UNIQUE KEY constraint ’AKOfficeAppointment_DoctorStartTime’. Cannot insert duplicate key in object ’Office.Appointment’. The duplicate key value is (1, 2016-07-12 14:00:00).
Next, we check the case where the appointment fits wholly inside of another appointment:
INSERT INTO Office.Appointment
VALUES (5,1,’20160712 14:30’,’20160712 14:40:59’);
This fails and tells us the doctor for whom the failure occurred:
Msg 50000, Level 16, State 16, Procedure appointment$insertAndUpdate, Line 48
Appointment for doctor 00001 overlapped existing appointment
Then, we test for the case where the entire appointment engulfs another appointment:
INSERT INTO Office.Appointment
VALUES (5,1,’20160712 11:30’,’20160712 17:59:59’);
This quite obediently fails, just like the other case:
Msg 50000, Level 16, State 16, Procedure appointment$insertAndUpdate, Line 48
Appointment for doctor 00001 overlapped existing appointment
And, just to drive home the point of always testing your code extensively, you should always test the greater-than-one-row case, and in this case, I included rows for both doctors:
INSERT into Office.Appointment
VALUES (5,1,’20160712 11:30’,’20160712 15:59:59’),
(6,2,’20160713 10:00’,’20160713 10:59:59’);
This time, it fails with our multirow error message:
Msg 50000, Level 16, State 16, Procedure appointment$insertAndUpdate, Line 48
One of the rows caused an overlapping appointment time for a doctor
Finally, add two rows that are safe to add:
INSERT INTO Office.Appointment
VALUES (5,1,’20160712 10:00’,’20160712 11:59:59’),
(6,2,’20160713 10:00’,’20160713 10:59:59’);
This will (finally) work. Now, test failing an update operation:
UPDATE Office.Appointment
SET StartTime = ’20160712 15:30’,
EndTime = ’20160712 15:59:59’
WHERE AppointmentId = 1;
This fails like it should:
Msg 50000, Level 16, State 16, Procedure appointment$insertAndUpdate, Line 38
Appointment for doctor 00001 overlapped existing appointment
If this seems like a lot of work, it is. And in reality, whether or not you actually implement this solution in a trigger is going to be determined by exactly what it is you are doing. However, the techniques of checking range uniqueness can clearly be useful if only to check existing data is correct, because in some cases, what you may want to do is to let data exist in intermediate states that aren’t pristine and then write checks to “certify” that the data is correct before closing out a day.
Instead of blocking the operation, the trigger could update rows to tell the user that it overlaps. Or you might even get more interesting and let the algorithms involve prioritizing certain conditions above other appointments. Maybe a checkup gets bumped for a surgery, marked to reschedule. Realistically, it may be that the user can overlap appointments all they want, and then at the close of business, a query such as the one that formed the basis of the trigger is executed by the administrative assistant, who then clears up any scheduling issues manually. In this section, I’ve shown you a pattern to apply to prevent range overlaps if that is the desired result. It is up to the requirements to lead you to exactly how to implement.
The most difficult case of uniqueness is actually quite common, and it is usually the most critical to get right. It is also a topic far too big to cover with a coded example, because in reality, it is more of a political question than a technical one. For example, if two people call in to your company from the same phone number and say their name is Abraham Lincoln, are they the same person or are they two aliases someone is using? (Or one alias and one real name?) Whether you can call them the same person is a very important decision and one that is based largely on the industry you are in, made especially tricky due to privacy laws (if you give one person who claims to be Abraham Lincoln the data of the real Abraham Lincoln, well, that just isn’t going to be good no matter what your privacy policy is or which laws that govern privacy apply). I don’t talk much about privacy laws in this book, mostly because that subject is very messy, but also because dealing with privacy concerns is
The principles of privacy are part of what makes the process of identification so difficult. At one time, companies would just ask for a customer’s Social Security number and use that as identification in a very trusting manner. Of course, no sooner does some value become used widely by lots of organizations than it begins to be abused. (Chapter 9 will expand a little bit on this topic as we talk about encryption technologies, but encryption is another wide topic for which the best advice is to make sure you are doing as much or more as is required.)
So the goal of your design is to work at getting your customer to use an identifier to help you distinguish them from another customer. This customer identifier will be used, for example, as a login to the corporate web site, for the convenience card that is being used by so many businesses, and also likely on any correspondence. The problem is how to gather this information. When a person calls a bank or doctor, the staff member answering the call always asks some random questions to better identify the caller. For many companies, it is impossible to force the person to give information, so it is not always possible to force customers to uniquely identify themselves. You can entice them to identify themselves, such as by issuing a customer savings card, or you can just guess from bits of information that can be gathered from a web browser, telephone number, and so on. Even worse, what if someone signs up twice for a customer number? Can you be sure that it is the same person, then?
So the goal becomes to match people to the often-limited information they are willing to provide. Generally speaking, you can try to gather as much information as possible from people, such as
And so on. Then, depending on the industry, you determine levels of matching that work for you. Lots of methods and tools are available to you, from standardization of data to make direct matching possible, fuzzy matching, and even third-party tools that will help you with the matches. The key, of course, is that if you are going to send a message alerting of a sale to repeat customers, only a slight bit of a match might be necessary, but if you are sending personal information, like how much money they have spent, a very deterministic match ought to be done. Identification of multiple customers in your database that are actually the same customer is the holy grail of marketing, but it is achievable given you respect your customer’s privacy and use their data in a safe manner.
Data-Driven Design
One of the worst practices I see some programmers get in the habit of doing is programming using specific values, to force a specific action. For example, they will get requirements that specify that for customers 1 and 2, we need to do action A, and for customer 3, we need to do action B. So they go in and code the following:
IF @customerId in (’1’, ’2’)
Do ActionA(@customerId);
ELSE IF @customerId in (’3’)
Do ActionB(@customerId);
It works, so they breathe a sigh of relief and move on. But the next day, they get a request that customer 4 should be treated in the same manner as customer 3. They don’t have time to do this request immediately because it requires a code change, which requires testing. So a month later, they add ’4’ to the code, test it, deploy it, and claim it required 40 hours of IT time.
This is clearly not optimal, so the next best thing is to determine why we are doing ActionA or ActionB. We might determine that for CustomerType: ’Great’, we do ActionA, but for ’Good’, we do ActionB. So you could code
IF @customerType = ’Great’
Do ActionA(@customerId);
ELSE IF @customerType = ’Good’
Do ActionB(@customerId);
Now adding another customer to these groups is a fairly simple case. You set the customerType column to Great or Good, and one of these actions occurs in you code automatically. But (as you might hear on any infomercial) you can do better! The shortcoming in this design is now how do you change the treatment of good customers if you want to have them do ActionA temporarily? In some cases, the answer is to add to the definition of the customerType table and add a column to indicate what action to take. So you might code:
--In real table, expand ActionType to be a more descriptive value or a domain of its own
CREATE SCHEMA Customers;
GO
CREATE TABLE Customers.CustomerType
(
CustomerType varchar(20) NOT NULL CONSTRAINT PKCustomerType PRIMARY KEY,
Description varchar(1000) NOT NULL,
ActionType char(1) NOT NULL CONSTRAINT CHKCustomerType_ActionType_Domain
CHECK (ActionType in (’A’,’B’))
);
Now, the treatment of this CustomerType can be set at any time to whatever the user decides. The only time you may need to change code (requiring testing, downtime, etc.) is if you need to change what an action means or add a new one. Adding different types of customers, or even changing existing ones, would be a nonbreaking change, so no testing is required.
The basic goal should be that the structure of data should represent the requirements, so rules are enforced by varying data, not by having to hard-code special cases. Flexibility at the code level is ultra important, particularly to your support staff. In the end, the goal of a design should be that changing configuration should not require code changes, so create attributes that will allow you to configure your data and usage.
Note In the code project part of the downloads for this chapter, you will find a coded example of data-driven design that demonstrates these principals in a complete, SQL coded solution.
One pattern that is quite often needed is to be able to see how a row looked at a previous point in time. For example, when did Employee 100001’s salary change? When did Employee 2010032’s insurance start? In some cases, you need to capture changes using a column in a table. For example, take the Employee table we used in the “Selective Uniqueness” section (if you are following along, this table already exists in the database, so you don’t need to re-create it):
CREATE TABLE HumanResources.Employee
(
EmployeeId int IDENTITY(1,1) CONSTRAINT PKEmployee primary key,
EmployeeNumber char(5) NOT NULL
CONSTRAINT AKEmployee_EmployeeNummer UNIQUE,
InsurancePolicyNumber char(10) NULL
);
CREATE UNIQUE INDEX AKFEmployee_InsurancePolicyNumber ON
HumanResources.Employee(InsurancePolicyNumber)
WHERE InsurancePolicyNumber IS NOT NULL;
One method of answering the question of when the insurance number changed would be to add
ALTER TABLE HumanResources.Employee
ADD InsurancePolicyNumberChangeTime datetime2(0);
Possibly with the addition of who changed the row. It could be managed with a trigger, or from the interface code. If you are dealing with a single column that needs to be verified, this is a great way to handle this need. But if you really want to see how all of the columns of a table have changed over time (or a larger subset than one for sure), there are two techniques using T-SQL that are quite popular:
In this section I will demonstrate both methods using the table we have created. Before starting each example, I will use the following code to reset the tables of data:
TRUNCATE TABLE HumanResources.Employee;
INSERT INTO HumanResources.Employee (EmployeeNumber, InsurancePolicyNumber)
VALUES (’A0001’,’1111111111’),
(’A0002’,’2222222222’),
(’A0003’,’3333333333’),
(’A0004’,NULL),
(’A0005’,NULL),
(’A0006’,NULL);
Note that neither version of this pattern is generally a version of auditing. The trigger model can be used more for auditing by adding columns in for who made the change, but there are better auditing tools built in to find people doing incorrect things with your data. This pattern is specifically set up to show changes in data, and it is not beyond reasonability that you might change history to cover up mistakes that were data related. Audits should never be changed or they will quickly be considered unreliable.
Using a Trigger to Capture History
While temporal support is a big new feature of SQL Server 2016, it does not take away the value of using a trigger to capture history. In this section, our goal is simply to see a log of modified rows, or rows that have been deleted from the table. For this exercise, we will start by creating a schema for the history that corresponds to the name of the schema that owns the data. This will allow you to manage security at the schema level for history different than for the base table. So giving the user SELECT rights of a schema to read the data does not give them rights to see the history data.
CREATE SCHEMA HumanResourcesHistory;
Next we will create a parallel history table in the new schema that has all of the columns of the original table, along with a few management columns, which have explanations in the following code:
CREATE TABLE HumanResourcesHistory.Employee
(
--Original columns
EmployeeId int NOT NULL,
EmployeeNumber char(5) NOT NULL,
InsurancePolicyNumber char(10) NULL,
--WHEN the row was modified
RowModificationTime datetime2(7) NOT NULL,
--WHAT type of modification
RowModificationType varchar(10) NOT NULL CONSTRAINT
CHKEmployeeSalary_RowModificationType
CHECK (RowModificationType IN (’UPDATE’,’DELETE’)),
--tiebreaker for seeing order of changes, if rows were modified rapidly
RowSequencerValue bigint IDENTITY(1,1) --use to break ties in RowModificationTime
);
Next, we create the following trigger. The basic flow is to determine the type of operation, then write the contents of the deleted table to the history table we have previously created.
CREATE TRIGGER HumanResources.Employee$HistoryManagementTrigger
ON HumanResources.Employee
AFTER UPDATE, DELETE AS
BEGIN
SET NOCOUNT ON;
SET ROWCOUNT 0; --in case the client has modified the rowcount
--use inserted for insert or update trigger, deleted for update or delete trigger
--count instead of @@rowcount due to merge behavior that sets @@rowcount to a number
--that is equal to number of merged rows, not rows being checked in trigger
DECLARE @msg varchar(2000), --used to hold the error message
@rowsAffected int = (SELECT COUNT(*) FROM deleted);
IF @rowsAffected = 0 RETURN;
DECLARE @RowModificationType char(6);
SET @RowModificationType = CASE WHEN EXISTS (SELECT * FROM inserted) THEN ’UPDATE’
ELSE ’DELETE’ END;
BEGIN TRY
--[validation section]
--[modification section]
--write deleted rows to the history table
INSERT HumanResourcesHistory.Employee(EmployeeId,EmployeeNumber,InsurancePolicyNumber,
RowModificationTime,RowModificationType)
SELECT EmployeeId,EmployeeNumber,InsurancePolicyNumber,
SYSDATETIME(), @RowModificationType
FROM deleted;
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0
ROLLBACK TRANSACTION;
THROW; --will halt the batch or be caught by the caller’s catch block
END CATCH;
END;
Now let’s make a few changes to the data in the table. As a reminder, here is the data we start with:
SELECT *
FROM HumanResources.Employee;
This shows us our base data:
----------- -------------- ---------------------
1 A0001 1111111111
2 A0002 2222222222
3 A0003 3333333333
4 A0004 NULL
5 A0005 NULL
6 A0006 NULL
Next, update EmployeeId = 4 and set that it has insurance:
UPDATE HumanResources.Employee
SET InsurancePolicyNumber = ’4444444444’
WHERE EmployeeId = 4;
You can see the change and the history here:
SELECT *
FROM HumanResources.Employee
WHERE EmployeeId = 4;
SELECT *
FROM HumanResourcesHistory.Employee
WHERE EmployeeId = 4;
EmployeeId EmployeeNumber InsurancePolicyNumber
----------- -------------- ---------------------
4 A0004 4444444444
EmployeeId EmployeeNumber InsurancePolicyNumber RowModificationTime
----------- -------------- --------------------- ---------------------------
4 A0004 NULL 2016-05-07 20:26:38.4578351
RowModificationType RowSequencerValue
------------------- --------------------
UPDATE 1
Now let’s update all of the rows where there is an insurance policy to a new format, and delete EmployeeId = 6. I updated all rows so we can see in the history what happens when a row is updated and does not actually change:
UPDATE HumanResources.Employee
SET InsurancePolicyNumber = ’IN’ + RIGHT(InsurancePolicyNumber,8);
DELETE HumanResources.Employee
WHERE EmployeeId = 6;
Then check out the data:
SELECT *
FROM HumanResources.Employee
ORDER BY EmployeeId;
--limiting output for formatting purposes
SELECT EmployeeId, InsurancePolicyNumber, RowModificationTime, RowModificationType
FROM HumanResourcesHistory.Employee
ORDER BY EmployeeId,RowModificationTime,RowSequencerValue;
This returns
EmployeeId EmployeeNumber InsurancePolicyNumber
----------- -------------- ---------------------
1 A0001 IN11111111
2 A0002 IN22222222
3 A0003 IN33333333
4 A0004 IN44444444
5 A0005 NULL
EmployeeId InsurancePolicyNumber RowModificationTime RowModificationType
----------- --------------------- --------------------------- -------------------
1 1111111111 2016-05-07 20:27:59.8852810 UPDATE
2 2222222222 2016-05-07 20:27:59.8852810 UPDATE
3 3333333333 2016-05-07 20:27:59.8852810 UPDATE
4 NULL 2016-05-07 20:26:38.4578351 UPDATE
4 4444444444 2016-05-07 20:27:59.8852810 UPDATE
5 NULL 2016-05-07 20:27:59.8852810 UPDATE
6 NULL 2016-05-07 20:27:59.8852810 UPDATE
6 NULL 2016-05-07 20:27:59.9347658 DELETE
Using this data, you can see the progression of what has happened to the data, starting from the time the trigger was added, including seeing rows that have been deleted. It is a well-worn method to capture history, but it is very hard to work with in your queries to look back on history.
Using Temporal Extensions to Manage History
Temporal extensions will provide the same basic information that we provided using the trigger in the previous section, but there is one major difference: query support. If you want to see the current data, there is no change to your query. But if you want to see how the data looked at a particular point in time, the only change to your query is to specify the time (or time range) for which you want to see history.
The biggest limitation on temporal tables is that the history copy must version all columns in the table. So if you are using an nvarchar(max) or even text (which you really shouldn’t be!), it will work, but you could incur massive performance issues if your values are very large. Even in-memory tables support temporal, but version table will be an on-disk table.
There are other limitations, such as requiring a primary key; TRUNCATE TABLE not allowed; foreign key CASCADE operations are not allowed on the table; INSTEAD OF triggers are not allowed on the table; and replication use is limited. Several other configuration limitations are included in this more complete list of considerations and limitations from Microsoft: msdn.microsoft.com/en-us/library/mt604468.aspx. However, the limitations are not terribly constraining if you need the DML extensions described later in the section.
I will continue to use the HumanResources table we used in the previous section, but I will reset the data and drop the history table from the previous section:
TRUNCATE TABLE HumanResources.Employee;
INSERT INTO HumanResources.Employee (EmployeeNumber, InsurancePolicyNumber)
VALUES (’A0001’,’1111111111’),
(’A0002’,’2222222222’),
(’A0003’,’3333333333’),
(’A0004’,NULL),
(’A0005’,NULL),
(’A0006’,NULL);
GO
DROP TABLE HumanResourcesHistory.Employee;
DROP TRIGGER HumanResources.Employee$HistoryManagementTrigger;
In the following subsections, I will cover configuring temporal extensions, along with how you can coordinate changes to multiple rows and how you can change history if needed.
Configuring Temporal Extensions
Now let’s add temporal extensions to the HumanResources.Employee table. To do this, you are required to have two columns in the table, one for a start time, which you will see often in the documentation as SysStartTime, and one for an end time, usually named SysEndTime. I will use a naming standard that matches my normal standard so they match the table. These time range columns must be datetime2, but they can be any precision from 0 to 7. These values are used to denote the start and end time that the row is valid, so when we query the rows using temporal extensions, the one active row can be picked. The precision of the column will determine how many versions you can have per second. If datetime2(0) is used, then you can have one version per second; if datetime2(7), then ~9999999 versions per second. While it is not likely that many readers will have such needs to track changes to this deep level, I tend to use (7) just because it feels like it is safer, if more difficult to type. (For this book I will use datetime2(1) to allow for the limited amount of text real estate I have available.)
The following code snippet adds the columns and the settings that we will use once we turn on system versioning (then drops the temporary constraints):
ALTER TABLE HumanResources.Employee
ADD
RowStartTime datetime2(1) GENERATED ALWAYS AS ROW START NOT NULL
--HIDDEN can be specified
--so temporal columns don’t show up in SELECT * queries
--This default will start the history of all existing rows at the
--current time (system uses UTC time for these values)
CONSTRAINT DFLTDelete1 DEFAULT (SYSUTCDATETIME()),
RowEndTime datetime2(1) GENERATED ALWAYS AS ROW END NOT NULL --HIDDEN
--data needs to be the max for the datatype
CONSTRAINT DFLTDelete2 DEFAULT (CAST(’9999-12-31 23:59:59.9’ AS datetime2(1)))
, PERIOD FOR SYSTEM_TIME (RowStartTime, RowEndTime);
GO
--DROP the constraints that are just there due to data being in the table
ALTER TABLE HumanResources.Employee
DROP CONSTRAINT DFLTDelete1;
ALTER TABLE HumanResources.Employee
DROP CONSTRAINT DFLTDelete2;
The GENERATED ALWAYS AS ROW START and END pair tells the system to set the value when the table is completely configured for temporal support. If you are creating a new table and want to turn on temporal extensions, you will use the same columns and settings in the CREATE TABLE statement, but you won’t need the DEFAULT constraints.
The next step is to create a version table. There are two ways to do this. The easiest is to just let SQL Server build it for you. You can either specify a name or let SQL Server pick one for you. For example, if we want SQL Server to create the history table, we will just use
ALTER TABLE HumanResources.Employee
SET (SYSTEM_VERSIONING = ON);
Now you can look in the system metadata and see what has been added:
SELECT tables.object_id AS baseTableObject,
CONCAT(historySchema.name,’.’,historyTable.name) AS historyTable
FROM sys.tables
JOIN sys.schemas
ON schemas.schema_id = tables.schema_id
LEFT OUTER JOIN sys.tables AS historyTable
JOIN sys.schemas AS historySchema
ON historySchema.schema_id = historyTable.schema_id
ON TABLES.history_table_id = historyTable.object_id
WHERE schemas.name = ’HumanResources’
AND tables.name = ’Employee’;
This returns something like the following, with almost certainly a different base table object_id:
baseTableObject historyTable
--------------- ----------------------------------------------------------
1330103779 HumanResources.MSSQL_TemporalHistoryFor_1330103779
This leads to a predictable but ugly name. The table will have the same columns as the base table, but will have a few differences that we will look at later when we cover creating your own table and modifying the data in the table to use previous historical data you have saved off.
While there may not be a reason all that often to look at the temporal tables, it will be useful to be able to correlate the names of the tables without knowing the object_id. So let’s go ahead and name the table ourselves in the DDL. First we need to disconnect the history table that was created and drop it. Before you can do much to the base table, in fact, you will have to turn off the system versioning:
ALTER TABLE HumanResources.Employee
SET (SYSTEM_VERSIONING = OFF);
DROP TABLE HumanResources.MSSQL_TemporalHistoryFor_1330103779;
Now let’s specify the table name. If it is an existing table, there will be more to do, in that you may want to backfill up history (like if you were previously trigger to capture history). Just like in the trigger method, I will use a different schema for the history tables, but you can put it in the same schema if so desired (you always have to specify the schema in the HISTORY_TABLE clause):
ALTER TABLE HumanResources.Employee
--must be in the same database
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = HumanResourcesHistory.Employee));
Taking a look at the metadata query we ran earlier, you can see that the history table has been set to a much better table name:
baseTableObject historyTable
--------------- --------------------------------------
1330103779 HumanResourcesHistory.Employee
I will cover security in more detail in Chapter 9, but understand that the user will need rights to the history table in order to SELECT the temporal aspects of the table. A user can modify the contents of the table with just INSERT, UPDATE, and/or DELETE rights.
Now you have configured the HumanResources.Employee table to capture history, starting with the time period of your ALTER statement to add the columns to the table. Check the table’s content using a SELECT statement:
SELECT *
FROM HumanResources.Employee;
This returns
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
1 A0001 1111111111 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
2 A0002 2222222222 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
3 A0003 3333333333 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
4 A0004 NULL 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
5 A0005 NULL 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
6 A0006 NULL 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
You can see the new columns added for RowStart… and RowEnd… Time. Using these timeframes, you will be able to see the data at given points of time. So if you wanted to see how the table would have looked on the fourth of May, use the FOR SYSTEM_TIME clause on the table in the FROM clause using AS OF a current point in time, in our case where RowStartTime >= PassedValue > RowEndTime. There are four others: FROM, BETWEEN, CONTAINED IN, and ALL. I will mostly make use of AS OF and ALL in the book, as usually I want to see data at a point in time, or I want to see all history to show you what has changed.
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME AS OF ’2016-05-04’;
This returns nothing, as the RowStart and End do not include that time period for any row in the table. The following query will (based on the data as I have it in my sample table) return the same as the previous query to get all rows in the base table, since 2016-05-11 is after all of the RowStartTime values in the base table:
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME AS OF ’2016-05-11’;
Dealing with Temporal Data One Row at a Time
When your application modifies a single row in a table that has temporal extensions enabled, there really isn’t much you need to consider in your application. Every INSERT, UPDATE, and DELETE operation will just capture changes, and let you query each table that is involved in the operation at a point in time. You can use the FROM FOR SYSTEM_TIME clause on any statement where a FROM clause makes sense. And you can use it on all tables that are used in a query, or just some. For example, the following is perfectly acceptable:
FROM Table1
JOIN Table2 FOR SYSTEM_TIME AS OF ’Time Literal’
ON ...
JOIN Table3 FOR SYSTEM_TIME AS OF ’Time Literal’
ON ...
And you can even do this:
FROM Table1 FOR SYSTEM_TIME AS OF ’Time Literal 1’
JOIN Table1 as DifferentLookAtTable1 FOR SYSTEM_TIME AS OF ’Time Literal 2’
ON ...
In the next section we will look more at coordinating modifications on multiple rows (in the same or multiple tables), but in this section, let’s look at the basic mechanics.
First let’s modify some data, to show what this looks like:
UPDATE HumanResources.Employee
SET InsurancePolicyNumber = ’4444444444’
WHERE EmployeeId = 4;
So let’s look at the data:
SELECT *
FROM HumanResources.Employee
WHERE EmployeeId = 4;
As expected:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
4 A0004 4444444444 2016-05-10 02:46:58.3 9999-12-31 23:59:59.9
But check just before the RowStartTime (.3 seconds to be precise):
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME AS OF ’2016-05-10 02:46:58’
WHERE EmployeeId = 4;
and the data looks just the same as it did pre-UPDATE execution:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
4 A0004 NULL 2016-05-10 02:35:49.1 2016-05-10 02:46:58.3
This is where ALL comes in handy, so you can see all of the changes:
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME ALL
ORDER BY EmployeeId, RowStartTime;
This returns all of the valid history rows (ones where RowStartTime <> RowEndTime, a situation that I will cover in a moment), including previous versions of data:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
1 A0001 1111111111 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
2 A0002 2222222222 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
3 A0003 3333333333 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
4 A0004 NULL 2016-05-10 02:35:49.1 2016-05-10 02:46:58.3
4 A0004 4444444444 2016-05-10 02:46:58.3 9999-12-31 23:59:59.9
5 A0005 NULL 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
6 A0006 NULL 2016-05-10 02:35:49.1 9999-12-31 23:59:59.9
Now, let’s delete EmployeeId = 6 (we don’t like him…he ate our gluten-free, 10-calorie cupcake):
DELETE HumanResources.Employee
WHERE EmployeeId = 6;
Then check out the data:
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME ALL
WHERE EmployeeId = 6
ORDER BY EmployeeId, RowStartTime;
Now you can see that the RowEndTime value is not ’9999-12-31 23:59:59.9’ but is set to the time of the DELETE:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
6 A0006 NULL 2016-05-10 02:35:49.1 2016-05-10 05:36:28.3
The reasons is that at that point in time, it did exist, but now it doesn’t. If he (EmployeeId = 6) apologizes and gets added back with the same surrogate key value, there would be a gap in time sequence that would correspond to the time when the row was removed.
A word of caution about versions. When the table has SYSTEM_VERSIONING ON, every update will cause a new version, even if no data changes. So,
UPDATE HumanResources.Employee
SET EmployeeNumber = EmployeeNumber
WHERE EmployeeId = 4;
is going to lead to a new version:
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME ALL
WHERE EmployeeId = 4
ORDER BY EmployeeId, RowStartTime;
This returns
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
4 A0004 NULL 2016-05-10 02:35:49.1 2016-05-10 02:46:58.3
4 A0004 4444444444 2016-05-10 02:46:58.3 2016-05-10 02:54:11.8
4 A0004 4444444444 2016-05-10 02:54:11.8 2016-05-10 02:59:36.6
But you can see that there is no difference between the second row in the output and the third. Now, let’s do five updates, immediately following one another using SSMS’s GO # extension:
UPDATE HumanResources.Employee
SET EmployeeNumber = EmployeeNumber
WHERE EmployeeId = 4;
GO 5
Looking at the data
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME ALL
WHERE EmployeeId = 4
ORDER BY EmployeeId, RowStartTime;
we see only five rows, but there should be eight, right? In this case, some rows were updated within what the system registered as the same time, so they are hidden. You can only see them in the history table:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
4 A0004 NULL 2016-05-10 02:35:49.1 2016-05-10 02:46:58.3
4 A0004 4444444444 2016-05-10 02:46:58.3 2016-05-10 02:54:11.8
4 A0004 4444444444 2016-05-10 02:54:11.8 2016-05-10 02:59:36.6
4 A0004 4444444444 2016-05-10 02:59:36.6 2016-05-10 03:01:06.5
4 A0004 4444444444 2016-05-10 03:01:06.5 9999-12-31 23:59:59.9
Three of the rows have the same RowStart and RowEnd times (your mileage may vary, depending on the precision of your row times; sometimes I did not see this phenomenon, but most every time I did, even using datetime2(7)):
SELECT *
FROM HumanResourcesHistory.Employee
WHERE EmployeeId = 4
AND RowStartTime = RowEndTime;
Here you see the remaining three rows:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
4 A0004 4444444444 2016-05-10 03:01:06.5 2016-05-10 03:01:06.5
4 A0004 4444444444 2016-05-10 03:01:06.5 2016-05-10 03:01:06.5
4 A0004 4444444444 2016-05-10 03:01:06.5 2016-05-10 03:01:06.5
If your application is very chatty and updates the same row over and over, you could end up with a lot of useless version rows.
Tip Another way you end up with start and end times that are the same is when you modify the same row multiple times in the same transaction. All of these modifications are hidden to the FOR SYSTEM_TIME, but can be seen in the history table.
Now that you have started accumulating history, you are free to query your data at any point in time, down to whatever your precision is set to. How did the data look yesterday at this point in time? Or the previous day at 11:00 AM? Compared to now? The value of this could be enormous. However, as we will dig deeper into in the next section, it brings up a problem. Now you can’t limit your thinking to just one point in time. Every time slice should be synchronized.
What if you accidentally set the InsurancePolicyNumber to NULL? Or you set it to an incorrect value? In a regular situation, you update the row and all is great. But if you are using your temporal versions to look back at your database at a point in time, reports may not look correct. You cannot simply update the history table, but rather have to turn off versioning, fix the history, and turn versioning back on. I will cover the process in the forthcoming section “Setting/Rewriting History,” but it is something to be done afterhours when no one can access the table, which is not optimal.
Dealing with Multiple Rows in One or More Tables
At the end of the previous section, I started the discussion about thinking temporally with one row at a time. Versions throughout history need coordinated data that does not tell any falsehoods (even if you have cleared them up later in your base table). In this section we will extend the concept to multiple rows. If you change two rows in a table, the only easy way to make sure their historical time values are the same if you are building your own using triggers is to make sure you do your update in a single statement. The temporal extensions give you a much better method of synchronizing changes. Basically, the start and end time columns are set at COMMIT time in the transaction. If you want to update all of the InsurancePolicyNumber values to include the letters ’IN’ as a prefix, and for some reason, you were unable to do this in a single statement (not every example can be realistic!), you wrap the change into a BEGIN and COMMIT transaction:
BEGIN TRANSACTION;
UPDATE HumanResources.Employee
SET InsurancePolicyNumber = CONCAT(’IN’,RIGHT(InsurancePolicyNumber,8))
WHERE EmployeeId = 1;
WAITFOR DELAY ’00:00:01’;
UPDATE HumanResources.Employee
SET InsurancePolicyNumber = CONCAT(’IN’,RIGHT(InsurancePolicyNumber,8))
WHERE EmployeeId = 2;
WAITFOR DELAY ’00:00:01’;
UPDATE HumanResources.Employee
SET InsurancePolicyNumber = CONCAT(’IN’,RIGHT(InsurancePolicyNumber,8))
WHERE EmployeeId = 3;
WAITFOR DELAY ’00:00:01’;
UPDATE HumanResources.Employee
SET InsurancePolicyNumber = CONCAT(’IN’,RIGHT(InsurancePolicyNumber,8))
WHERE EmployeeId = 4;
COMMIT TRANSACTION;
Looking at the data
SELECT *
FROM HumanResources.Employee
WHERE InsurancePolicyNumber IS NOT NULL
ORDER BY EmployeeId;
you can see that the RowStartTime values for every row that was updated is exactly the same, even if the UPDATE statements weren’t actually executed at the same point in time:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
1 A0001 IN11111111 2016-05-10 03:08:50.9 9999-12-31 23:59:59.9
2 A0002 IN22222222 2016-05-10 03:08:50.9 9999-12-31 23:59:59.9
3 A0003 IN33333333 2016-05-10 03:08:50.9 9999-12-31 23:59:59.9
4 A0004 IN44444444 2016-05-10 03:08:50.9 9999-12-31 23:59:59.9
For simplicity’s sake, I won’t try to show multiple tables as an example, but the same thing holds true across multiple tables. Every row in a temporal table that is affected in a transaction will have the same start time (and end time) in the corresponding history table. This will allow us to see that before the change was made, policy numbers started with ’AA’, but now they start with ’IN’, and the user doesn’t have to think that rows 1-10000 have one start time and rows 10001+ have a different start time.
Setting/Rewriting History
History rows cannot be modified at all if the table is connected to a table to represent historical rows, but there are two major places where you may need to change history rows:
As a very simple example, let’s change all of our history to go back to the start of the year 2016. First, let’s find the time we started keeping temporal data on the HumanResources.Employee table:
SELECT MIN(RowStartTime)
FROM HumanResources.Employee FOR SYSTEM_TIME ALL;
This returns (for me, on my 100th+ time of running this script to get it just right).
---------------------------
2016-05-10 02:35:49.1
This is the time value we will need to create new version rows later. Next, we will turn off versioning, which will turn HumanResourcesHistory.Employee into a regular table that can be modified:
ALTER TABLE HumanResources.Employee
SET (SYSTEM_VERSIONING = OFF);
The next step is to update all of the rows that have ’2016-05-10 02:35:49.1’ as their start time to ’2016-01-01’. You will not want to do this for any row’s minimum, because that means they were started after versioning was turned on. (And in a real case, you will want to do a lot of research to determine what times make sense for all of the rows, as you will want to figure out what their actual first time of existence was.)
--Rows that have been modified
UPDATE HumanResourcesHistory.Employee
SET RowStartTime = ’2016-01-01 00:00:00.0’
WHERE RowStartTime = ’2016-05-10 02:35:49.1’; --value from previous select if you are
--following along in the home game
Additionally, you will need to generate history rows for rows that had not been modified yet, as you cannot change the RowStartTime:
INSERT INTO HumanResourcesHistory.Employee (EmployeeId, EmployeeNumber,
InsurancePolicyNumber, RowStartTime, RowEndTime)
SELECT EmployeeId, EmployeeNumber, InsurancePolicyNumber,
’2016-01-01 00:00:00.0’,
RowStartTime --use the rowStartTime in the row for the endTime of the history
FROM HumanResources.Employee
WHERE NOT EXISTS (SELECT *
FROM HumanResourcesHistory.Employee AS HistEmployee
WHERE HistEmployee.EmployeeId = Employee.EmployeeId);
If you have done it correctly, you will have one row per Employee row that you want to go back to the 1st of January returned in the following query:
SELECT *
FROM HumanResourcesHistory.Employee
WHERE RowStartTime = ’2016-01-01 00:00:00.0’
ORDER BY EmployeeId;
Which we do:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
1 A0001 1111111111 2016-01-01 00:00:00.0 2016-05-10 03:08:50.9
2 A0002 2222222222 2016-01-01 00:00:00.0 2016-05-10 03:08:50.9
3 A0003 3333333333 2016-01-01 00:00:00.0 2016-05-10 03:08:50.9
4 A0004 NULL 2016-01-01 00:00:00.0 2016-05-10 02:46:58.3
5 A0005 NULL 2016-01-01 00:00:00.0 2016-05-10 02:35:49.1
6 A0006 NULL 2016-01-01 00:00:00.0 2016-05-10 02:35:49.1
Then turn back on system versioning:
ALTER TABLE HumanResources.Employee
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = HumanResourcesHistory.Employee));
When you run the following:
SELECT *
FROM HumanResources.Employee FOR SYSTEM_TIME AS OF ’2016-01-01 00:00:00.0’
ORDER BY EmployeeId;
you can see that now your data seems to have existed since the start of 2016, instead of when I was writing this chapter:
EmployeeId EmployeeNumber InsurancePolicyNumber RowStartTime RowEndTime
----------- -------------- --------------------- --------------------- ---------------------
1 A0001 1111111111 2016-01-01 00:00:00.0 2016-05-10 03:08:50.9
2 A0002 2222222222 2016-01-01 00:00:00.0 2016-05-10 03:08:50.9
3 A0003 3333333333 2016-01-01 00:00:00.0 2016-05-10 03:08:50.9
4 A0004 NULL 2016-01-01 00:00:00.0 2016-05-10 02:46:58.3
5 A0005 NULL 2016-01-01 00:00:00.0 2016-05-10 02:35:49.1
6 A0006 NULL 2016-01-01 00:00:00.0 2016-05-10 02:35:49.1
You can make other changes to the history while the tables are not paired, but this is definitely one of the easiest. In the downloadable code I have an additional example that will change EmployeeNumber ’A0005’ to have had insurance since the start of March. This will entail splitting a history in two, so you have one history row for the before image, and another for the after image. It is messy and tedious, so you will want to set up repeatable code processes if you have to modify history in anything other than a very simple manner on a repeating process.
Hierarchies are a peculiar topic in relational databases. Hierarchies happen everywhere in the “real” world, starting with a family tree, corporation organizational charts, species charts, and parts breakdowns. Even orders for products form a hierarchy with customers buying products, which then link to other customers who bought the same products. In the Lego example from earlier in this chapter, if modeled to completion, it would include a hierarchy for sets as sometimes sets are parts of other sets to create a complete bill of materials for any set.
Structure-wise, there are two sorts of hierarchies you will face in the real world, a tree structure, where every item can have only one parent, and graphs, where you can have more than one parent in the structure. The challenge is to implement hierarchies in such a manner that they are optimal for your needs, particularly as they relate to the operations of your OLTP database. In this section, we will briefly go over the two major methods for implementing hierarchies that are the most common for use in SQL Server:
Finally, we’ll take a brief architectural overview of a few other methods made popular by a couple of famous data architects; these methods can be a lot faster to use but require a lot more overhead to maintain, but sometimes, they’re just better when your hierarchy is static and you need to do a lot of processing or querying.
The examples in the text will be generally limited to creating some data and adding data. However, in the downloads you will find a more complete coverage of hierarchies as a companion to the text, with fully fleshed-out code samples to use as a jumping-off point to implement any of the subpatterns presented here. Each method covered here will also include code to
The last item will be used as a comparison activity to demonstrate the relative speed of each algorithm. Each method that is represented will include a method to load around a half million nodes into the hierarchy to test performance.
Self-Referencing/Recursive Relationship/Adjacency List
The self-referencing relationship is definitely the easiest method to implement a hierarchy. I covered it a bit back in Chapter 3 when I discussed recursive relationships. They are considered recursive because of the way they are worked with, in both procedural and relational code. In this section, I will cover trees (which are single-parent hierarchies) and then graphs, which allow every node to have multiple parents.
Trees (Single-Parent Hierarchies)
To get started, I will create a table that implements a corporate structure with just a few basic attributes, including a self-referencing column. The goal will be to implement a corporate structure like the one shown in Figure 8-3.
Figure 8-3. Demonstration company hierarchy
The most important thing to understand when dealing with trees in SQL Server is that the most efficient way to work with trees in a procedural language is not the most efficient way to work with data in a set-based relational language. Both use recursion, but the implementation is very different. If you were searching a tree in a functional language, a very common algorithm would be to traverse the tree one node at a time, from the topmost item, down to the lowest in the tree, and then work your way around to all of the nodes. This is generally done using a recursive algorithm, based on the ordering of the items in the tree. In Figure 8-4, I show this for the left side of the tree.
Figure 8-4. Sample tree structure searched depth first
This is referred to as a depth-first search and is fast when the language is optimized for single-instance-at-a-time access, particularly when you can load the entire tree structure into RAM. If you attempted to implement this using T-SQL, you would find that it is obnoxiously slow, as most any iterative processing can be. In SQL, we use what is called a breadth-first search that can be scaled to many more nodes, because the number of queries is limited to the number of levels in the hierarchy. The limitations here pertain to the size of the temporary storage needed and how many rows you end up with on each level. Joining to an unindexed temporary set is bad in your code, and it is not good in SQL Server’s algorithms either.
A tree can be broken down into levels, from the parent row that you are interested in. From there, the levels increase as you are one level away from the parent, as shown in Figure 8-5.
Figure 8-5. Sample tree structure with levels
Now, working with this structure will deal with each level as a separate set, joined to the matching results from the previous level. You iterate one level at a time, matching rows from one level to the next. This reduces the number of queries to use the data down to three, rather than a minimum of eight, plus the overhead of going back and forth from parent to child. In SQL, we will use a recursive CTE (common table expression), where the recursion isn’t based on ordering, but rather the anchor SQL referencing the object it is a part of.
To demonstrate working with adjacency list tables, let’s create a table to represent a hierarchy of companies that are parent to one another. The goal of our table will be to implement the structure, as shown in Figure 8-6.
Figure 8-6. Diagram of basic adjacency list
So we will create the following table:
CREATE SCHEMA Corporate;
GO
CREATE TABLE Corporate.Company
(
CompanyId int NOT NULL CONSTRAINT PKCompany PRIMARY KEY,
Name varchar(20) NOT NULL CONSTRAINT AKCompany_Name UNIQUE,
ParentCompanyId int NULL
CONSTRAINT Company$isParentOf$Company REFERENCES Corporate.Company(companyId)
);
Then, load data to set up a table like the graphic in Figure 8-3:
INSERT INTO Corporate.Company (CompanyId, Name, ParentCompanyId)
VALUES (1, ’Company HQ’, NULL),
(2, ’Maine HQ’,1), (3, ’Tennessee HQ’,1),
(4, ’Nashville Branch’,3), (5, ’Knoxville Branch’,3),
(6, ’Memphis Branch’,3), (7, ’Portland Branch’,2),
(8, ’Camden Branch’,2);
Now, taking a look at the data
SELECT *
FROM Corporate.Company;
returns the following: