Chapter 7. Temporal Tables

Databases that serve business applications often support temporal data. For example, suppose a contract with a supplier is valid for a limited time only. It can be valid from a specific point in time onward, or it can be valid for a specific time interval—from a starting time point to an ending time point. In addition, on many occasions, you need to audit all changes in one or more tables. You might also need to be able to show the state in a specific point in time, or all changes made to a table in a specific period of time. From a data integrity perspective, you might need to implement many additional temporal-specific constraints.

This chapter introduces temporal problems, deals with manual solutions, shows out-of-the-box solutions in SQL Server 2016, and deals with the following topics:

  • Defining temporal data
  • Using temporal data in SQL Server before version 2016
  • System-versioned tables in SQL Server 2016
  • What kind of temporal support is still missing in SQL Server 2016?

What is temporal data?

In a table with temporal support, the header represents a predicate with at least one time parameter that represents when the rest of the predicate is valid—the complete predicate is therefore a timestamped predicate. Rows represent timestamped propositions, and the valid time period of a row is expressed with one of two attributes: since (for semi temporal data), or during (for fully temporal data); the latter attribute is usually represented with two values, from and to.

The following shows the original and two additional timestamped versions of an exemplary Suppliers table.

What is temporal data?

Figure 7.1: Original Suppliers table and two tables with temporal support

From the original table header, you can read a predicate saying that a supplier with identification supplierid, named companyname, having contact contactname, and so on, is currently our supplier, or is currently under contract. You can pretend that this supplier is the supplier forever, from the beginning of time until the end of time. The Suppliers_Since table header has this predicate modified with a time parameter: a supplier with the identification supplierid, named companyname, having contact contactname, and so on, is under contract since a specific point in time. In the Suppliers_FromTo table, the header has this predicate modified with an even more specific time attribute: a supplier with ID supplierid, named companyname, having contact contactname, and so on, is (or was, or will be, depending on the current time) under contract from a specific point in time to another point in time.

There is no need to implement semi-temporal tables. You can simply use the maximal possible date and time for the to time point. Therefore, the rest of the chapter focuses on fully temporal data only.

In this section, you will learn about:

  • Types of temporal tables
  • Temporal data algebra
  • Temporal constraints
  • Temporal data implementation in SQL Server before version 2016
  • Optimization of temporal queries

Types of temporal table

You might have noticed during the introduction part at the beginning of this chapter that there are two kinds of temporal issues. The first one is the validity time of the proposition—in which period the proposition that a timestamped row in a table represents was actually true. For example, a contract with a supplier was valid only from time point 1 to time point 2. This kind of validity time is meaningful to people, meaningful for the business. The validity time is also called application time or human time. We can have multiple valid periods for the same entity. For example, the aforementioned contract that was valid from time point 1 to time point 2 might also be valid from time point 7 to time point 9.

The second temporal issue is the transaction time. A row for the contract mentioned above was inserted in time point 1 and was the only version of the truth known to the database until somebody changed it, or even till the end of time. When the row is updated at time point 2, the original row is known as being true to the database from time point 1 to time point 2. A new row for the same proposition is inserted with time valid for the database from time point 2 to the end of the time. The transaction time is also known as system time or database time.

Database management systems can, and should, maintain transaction times automatically. The system has to insert a new row for every update and change the transaction validity period in the original row. The system also needs to allow the querying of the current and historical data, and show the state at any specific point in time. There are not many additional issues with the transaction time. The system has to take care that the start time of the database time period is lower than the end time, and that two periods in two rows for the same entity don't overlap. The database system has to know a single truth at a single point in time. Finally, the database does not care about the future. The end of the database time of the current row is actually the end of time. Database time is about present and past states only.

Implementing the application time might be much more complex. Of course, you might have validity time periods that end, or even begin, in the future. Database management systems can't take care of future times automatically, or check whether they are correct. Therefore, you need to take care of all the constraints you need. The database management system can only help you by implementing time-aware objects, such as declarative constraints. For example, a foreign key from the products to the suppliers table, which ensures that each product has a supplier, could be extended to check not only whether the supplier for the product exists, but also whether the supplier is a valid supplier at the time point when the foreign key is checked.

So far, I've talked about time as though it consists of discrete time points; I used the term time point as if it represented a single, indivisible, infinitely small point in time. Of course, time is continuous. Nevertheless, in common language, we talk about time as though it consists of discrete points. We talk in days, hours, and other time units; the granularity we use depends on what we are talking about. The time points we are talking about are actually intervals of time; a day is an interval of 24 hours, an hour is an interval of 60 minutes, and so on.

So, what is the granularity level of the time points for the system and application intervals? For the system times, the decision is simple: use the lowest granularity level that a system supports. In SQL Server, with the datetime2 data type, you can support 100 nanoseconds granularity. For the application time, the granularity depends on the business problem. For example, for a contract with a supplier, the day level could work well. For measuring the intervals when somebody is using a service, such as a mobile phone service, granularity of a second may be more appropriate. This looks very complex. However, you can make a generalized solution for the application times. You can translate time points to integers, and then use a lookup table that gives you the context and gives meaning to the integer time points.

Of course, you can also implement both application and system-versioned tables. Such tables are called bitemporal tables.

Allen's interval algebra

The theory for the temporal data in a relational model started to evolve more than thirty years ago. I will define quite a few useful Boolean operators and a couple of operators that work on intervals and return an interval. These operators are known as Allen's operators, named after J. F. Allen, who defined a number of them in a 1983 research paper on temporal intervals. All of them are still accepted as valid and needed. A database management system could help you when dealing with application times by implementing these operators out-of-the-box.

Let me first introduce the notation I will use. I will work on two intervals, denoted i1 and i2. The beginning time point of the first interval is b1, and the end is e1; the beginning time point of the second interval is b2 and the end is e2. Allen's Boolean operators are defined in the following table:

Name

Notation

Definition

Equals

(i1 = i2)

(b1 = b2) AND (e1 = e2)

Before

(ibefore i2)

(e< b2)

After

(iafter i2)

(ibefore i1)

Includes

(i i2)

(b≤ b2) AND (e≥ e2)

Properly includes

(i1  i2)

(i1  i2) AND (i≠ i2)

Meets

(imeets i2)

(b2 = e1 + 1) OR (b1 = e2 + 1)

Overlaps

(ioverlaps i2)

(b≤ e2) AND (b≤ e1)

Merges

(imerges i2)

(ioverlaps i2) OR (imeets i2)

Begins

(ibegins i2)

(b= b2) AND (e≤ e2)

Ends

(iends i2)

(e= e2) AND (b≥ b2)

In addition to Boolean operators, there are three of Allen's operators that accept intervals as input parameters and return an interval. These operators constitute simple interval algebra. Note that those operators have the same name as relational operators you are probably already familiar with: Union, Intersect, and Minus. However, they don't behave exactly like their relational counterparts. In general, using any of the three interval operators, if the operation would result in an empty set of time points or in a set that cannot be described by one interval, then the operator should return NULL. A union of two intervals makes sense only if the intervals meet or overlap. An intersection makes sense only if the intervals overlap. The Minus interval operator makes sense only in some cases. For example, (3:10) Minus (5:7) returns NULL because the result cannot be described by one interval. The following table summarizes the definition of the operators of interval algebra:

Name

Notation

Definition

Union

(i1 union i2)

(Min(b1, b2) : Max(e1, e2)), when (i1merges i2);

NULL otherwise

Intersect

(iintersect i2)

(Max(b1, b2) : Min(e1, e2)), when (i1overlaps i2);

NULL otherwise

Minus

(iminus i2)

(b1: Min(b2 - 1, e1)), when (b1 < b2) AND (e1≤ e2);

(Max(e2 + 1, b1) : e1), when (b1≥ b2) AND (e1> e2);

NULL otherwise

The following figure shows the interval algebra operators graphically:

Allen's interval algebra

Figure 7.2: Interval algebra operators

Temporal constraints

Depending on the business problem you are solving, you might need to implement many temporal constraints. Remember that for the application time, SQL Server does not help you much. You need to implement the constraints in your code using SQL Server declarative constraints where possible. However, most of the constraints you need to implement through custom code, either in triggers or in stored procedures, or even in the application code.

Imagine the Suppliers table example. One supplier can appear multiple times in the table because the same supplier could be under contract for separate periods of time. For example, you could have two tuples like this in the relation with the shortened header Suppliers (supplierid, companyname, from, to):

{2, Supplier VHQZD, d05, d07} 
{2, Supplier VHQZD, d12, d27} 

Here are some possible constraints you might need to implement:

  • To should never be less than from
  • Two contracts for the same supplier should not have overlapping time intervals
  • Two contracts for the same supplier should not have abutting time intervals
  • No supplier can be under two distinct contracts at the same point in time
  • There should be no supplies from a supplier at a point in time when the supplier was not under a contract

You might find even more constraints. Anyway, SQL Server 2016 brings support for the system-versioned tables only. To maintain the application validity times, you need to develop code by yourself.

Temporal data in SQL Server before 2016

As mentioned, in SQL Server versions before 2016, you need to take care of temporal data by yourself. Even in SQL Server 2016, you still need to take care of the human, or application times. The following code shows an example of how to create a table with validity intervals expressed with the b and e columns, where the beginning and the end of an interval are represented as integers. The table is populated with demo data from the WideWorldImporters.Sales.OrderLines table. Note that there are multiple versions of the WideWorldImporters database, so you might get slightly different results.

I used the WideWorldImporters-Standard.bak backup file from https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0 to restore this demo database on my SQL Server instance.

USE tempdb; 
GO 
SELECT OrderLineID AS id, 
 StockItemID * (OrderLineID % 5 + 1) AS b, 
 LastEditedBy + StockItemID * (OrderLineID % 5 + 1) AS e 
INTO dbo.Intervals 
FROM WideWorldImporters.Sales.OrderLines; 
-- 231412 rows 
GO 
ALTER TABLE dbo.Intervals ADD CONSTRAINT PK_Intervals PRIMARY KEY(id); 
CREATE INDEX idx_b ON dbo.Intervals(b) INCLUDE(e); 
CREATE INDEX idx_e ON dbo.Intervals(e) INCLUDE(b); 
GO 

Please also note the indexes created. The two indexes are optimal for searches at the beginning of an interval or at the end of an interval. You can check the minimal begin and maximal end of all intervals with the following code:

SELECT MIN(b), MAX(e) 
FROM dbo.Intervals; 

You can see in the results that the minimal beginning time point is 1 and maximal ending time point is 1155. Now you need to give the intervals some time context. In this case, a single time point represents a day. The following code creates a date lookup table and populates it. Note that the starting date is July 1, 2014.

CREATE TABLE dbo.DateNums 
 (n INT NOT NULL PRIMARY KEY, 
  d DATE NOT NULL); 
GO 
DECLARE @i AS INT = 1,  
 @d AS DATE = '20140701'; 
WHILE @i <= 1200 
BEGIN 
INSERT INTO dbo.DateNums 
 (n, d) 
SELECT @i, @d; 
SET @i += 1; 
SET @d = DATEADD(day,1,@d); 
END; 
GO 

Now you can join the dbo.Intervals table to the dbo.DateNums table twice, to give the context to the integers that represent the beginning and the end of the intervals:

SELECT i.id, 
 i.b, d1.d AS dateB, 
 i.e, d2.d AS dateE 
FROM dbo.Intervals AS i 
 INNER JOIN dbo.DateNums AS d1 
  ON i.b = d1.n 
 INNER JOIN dbo.DateNums AS d2 
  ON i.e = d2.n 
ORDER BY i.id; 

The abbreviated result from the previous query is:

id     b      dateB         e      dateE
--     ---    ----------    ---    ----------
1      328    2015-05-24    332    2015-05-28
2      201    2015-01-17    204    2015-01-20
3      200    2015-01-16    203    2015-01-19

Now you can see which day is represented by which integer.

Optimizing temporal queries

Note

The problem with temporal queries is that when reading from a table, SQL Server can use only one index, and successfully eliminate rows that are not candidates for the result from one side only, and then scan the rest of the data. For example, you need to find all intervals in the table which overlap with a given interval. Remember, two intervals overlap when the beginning of the first one is lower or equal to the end of the second one, and the beginning of the second one is lower or equal to the end of the first one, or mathematically, when (b≤  e2)AND (b≤  e1).

The following query searched for all of the intervals that overlap with the interval (10, 30). Note that the second condition (b≤ e1) is turned around to (e ≥ b2) for simpler reading (the beginning and the end of intervals from the table are always on the left side of the condition). The given, or the searched interval, is at the beginning of the timeline for all intervals in the table.

SET STATISTICS IO ON; 
DECLARE @b AS INT = 10, 
 @e AS INT = 30; 
SELECT id, b, e 
FROM dbo.Intervals 
WHERE b <= @e 
  AND e >= @b 
OPTION (RECOMPILE); 
GO 

The query used 36 logical reads. If you check the execution plan, you can see that the query used the index seek in the idx_b index with the seek predicate [tempdb].[dbo].[Intervals].b <= Scalar Operator((30)) and then scanned the rows and selected the resulting rows using the residual predicate [tempdb].[dbo].[Intervals].[e]>=(10). Because the searched interval is at the beginning of the timeline, the seek predicate successfully eliminated the majority of the rows; only a few intervals in the table have a beginning point lower than or equal to 30.

You would get a similarly efficient query if the searched interval was at the end of the timeline, just that SQL Server would use the idx_e index to seek. However, what happens if the searched interval is in the middle of the timeline, as the following query shows?

DECLARE @b AS INT = 570, 
 @e AS INT = 590; 
SELECT id, b, e 
FROM dbo.Intervals 
WHERE b <= @e 
  AND e >= @b 
OPTION (RECOMPILE); 
GO 

This time, the query used 111 logical reads. With a bigger table, the difference from the first query would be even bigger. If you check the execution plan, you will see that SQL Server used the idx_e index with the [tempdb].[dbo].[Intervals].e >= Scalar Operator((570)) seek predicate and [tempdb].[dbo].[Intervals].[b]<=(590) residual predicate. The seek predicate excludes approximately half of the rows from one side, while half of the rows from the other side are scanned and the resulting rows are extracted with the residual predicate.

There is a solution which would use that index for elimination of the rows from both sides of the searched interval by using a single index. The following figure shows this logic:

Optimizing temporal queries

Figure 7.3: Optimizing temporal query

The intervals in the figure are sorted by the lower boundary, representing SQL Server's usage of the idx_b index. Eliminating intervals from the right side of the given (searched) interval is simple: just eliminate all intervals where the beginning is at least one unit bigger (more to the right) of the end of the given interval. You can see this boundary in the figure denoted with the rightmost dotted line. However, eliminating from the left is more complex. In order to use the same index, the idx_b index for eliminating from the left, I need to use the beginning of the intervals in the table in the WHERE clause of the query. I have to go to the left side, away from the beginning of the given (searched) interval, at least for the length of the longest interval in the table, which is marked with a callout in the figure. The intervals that begin before the left yellow line cannot overlap with the given (blue) interval.

Since I already know that the length of the longest interval is 20, I can write an enhanced query in quite a simple way:

DECLARE @b AS INT = 570, 
 @e AS INT = 590; 
DECLARE @max AS INT = 20; 
SELECT id, b, e 
FROM dbo.Intervals 
WHERE b <= @e AND b >= @b - @max 
  AND e >= @b AND e <= @e + @max 
OPTION (RECOMPILE); 

This query retrieves the same rows as the previous one with 20 logical reads only. If you check the execution plan, you can see that the idx_b index was used, with the seek predicate Seek Keys[1]: Start: [tempdb].[dbo].[Intervals].b >= Scalar Operator((550)), End: [tempdb].[dbo].[Intervals].b <= Scalar Operator((590)), which successfully eliminated rows from both sides of the timeline, and then the residual predicate [tempdb].[dbo].[Intervals].[e]>=(570) AND [tempdb].[dbo].[Intervals].[e]<=(610) was used to select rows from a very limited partial scan.

Of course, the figure could be turned around to cover cases when the idx_e index would be more useful. With this index, the elimination from the left is simple—eliminate all of the intervals which end at least one unit before the beginning of the given interval. This time, the elimination from the right is more complex—the end of the intervals in the table cannot be more to the right than the end of the given interval plus the maximal length of all intervals in the table.

Please note that this performance is the consequence of the specific data in the table. The maximal length of an interval is 20. This way, SQL Server can very efficiently eliminate intervals from both sides. However, if there is only one long interval in the table, the code will become much less efficient, because SQL Server would not be able to eliminate a lot of rows from one side, either left or right, depending on which index it used. Anyway, in real life, interval length does not vary a lot, so this optimization technique might be very useful, especially because it is simple.

After you have finished with temporal queries in this section, you can clean up your tempdb database with the following code:

DROP TABLE dbo.DateNums; 
DROP TABLE dbo.Intervals; 

Temporal features in SQL:2011

Temporal data support was introduced in the most recent revision of the SQL standard—SQL:2011. There were also attempts to define the support in previous standard versions, but without success (TSQL2 extensions in 1995). They were not widely accepted and vendors did not implement them.

Finally, the ANSI SQL:2011 standard proposed how temporal data should be supported in relational database management systems. A very important thing is that SQL:2011 did not introduce a new data type to support temporal data, rather it introduced the period.

A period is a table attribute and it's defined by two table columns of date type representing start time and end time respectively. It is defined as follows:

  • A period must have a name
  • The end time must be greater than the start time
  • It is a closed-open period model. The start time is included in the period and the end time is excluded

The SQL:2011 standard recognizes two dimensions of temporal data support:

  • Valid or application time tables
  • Transaction or system time tables

Application-time period tables are intended for meeting the requirements of applications that capture time periods during which the data is believed to be valid in the real world. A typical example of such an application is an insurance application, where it is necessary to keep track of the specific policy details of a given customer that are in effect at any given point in time.

System-versioned tables are intended for meeting the requirements of applications that must maintain an accurate history of data changes either for business reasons, legal reasons, or both. A typical example of such an application is a banking application, where it is necessary to keep previous states of customer account information so that customers can be provided with a detailed history of their accounts. There are also plenty of examples where certain institutions are required by law to preserve historical data for a specified length of time to meet regulatory and compliance requirements.

Bitemporal tables are tables that implement both application-time and system-versioned time support.

After the standard was published, many vendors came up with the temporal table implementation:

  • IDM DB2 10 added full support for temporal tables (for both application time and system-versioned).
  • Oracle implemented a feature called the Flashback Data Archive. It automatically tracks all changes made to data in a database and maintains an archive of historical data. Oracle 12c introduced valid time temporal support.
  • PostgreSQL doesn't support temporal tables natively, but temporal tables approximate them
  • Teradata implements both valid time and transaction time table types based on the TSQL2 specification.
  • All these implementations most probably affected Microsoft's decision to implement temporal tables in SQL Server 2016.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset