2. A Taxonomy of Bi-Temporal Data Management Methods
Contents
Taxonomies28
Partitioned Semantic Trees28
Jointly Exhaustive31
Mutually Exclusive32
A Taxonomy of Methods for Managing Temporal Data34
The Root Node of the Taxonomy35
Queryable Temporal Data: Events and States37
State Temporal Data: Uni-Temporal and Bi-Temporal Data41
Glossary References46
In Chapter 1, we presented an historical account of various ways that temporal data has been managed with computers. In this chapter, we will develop a taxonomy, and situate those methods described in Chapter 1, as well as several variations on them, in this taxonomy.
A taxonomy is a special kind of hierarchy. It is a hierarchy which is a partitioning of the instances of its highest-level node into different kinds, types or classes of things. While an historical approach tells us how things came to be, and how they evolved over time, a taxonomic approach tells us what kinds of things we have come up with, and what their similarities and differences are. In both cases, i.e. in the previous chapter and in this one, the purpose is to provide the background for our later discussions of temporal data management and, in particular, of how Asserted Versioning supports non-temporal, uni-temporal and bi-temporal data by means of physical bi-temporal tables. 1
1Because Asserted Versioning directly manages bi-temporal tables, and supports uni-temporal tables as views on bi-temporal tables, we sometimes refer to it as a method of bi-temporal data management and at other times refer to it as a method of temporal data management. The difference in terminology, then, reflects simply a difference in emphasis which may vary depending on context.
Taxonomies
Originally, the word “taxonomy” referred to a method of classification used in biology, and introduced into that science in the 18th century by Carl Linnaeus. Taxonomy in biology began as a system of classification based on morphological similarities and differences among groups of living things. But with the modern synthesis of Darwinian evolutionary theory, Mendelian genetics, and the Watson–Crick discovery of the molecular basis of life and its foundations in the chemistry of DNA, biological taxonomy has, for the most part, become a system of classification based on common genetic ancestry.
Partitioned Semantic Trees
As borrowed by computer scientists, the term “taxonomy” refers to a partitioned semantic tree. A tree structure is a hierarchy, which is a set of non-looping (acyclic) one-to-many relationships. In each relationship, the item on the “one” side is called the parent item in the relationship, and the one or more items on the “many” side are called the child items. The items that are related are often called nodes of the hierarchy. Continuing the arboreal metaphor, a tree consists of one root node (usually shown at the top of the structure, and not, as the metaphor would lead one to expect, at the bottom), zero or more branch nodes, and zero or more leaf nodes on each branch. This terminology is illustrated in Figure 2.1.
B9780123750419000029/f02-01-9780123750419.jpg is missing
Figure 2.1
An Illustrative Taxonomy.
Tree structure. Each taxonomy is a hierarchy. Therefore, except for the root node, every node has exactly one parent node. Except for the leaf nodes, unless the hierarchy consists of the root node only, every node has at least one child node. Each node except the root node has as ancestors all the nodes from its direct parent node, in linear ascent from child to parent, up to and including the root node. No node can be a parent to any of its ancestors.
Partitioned. The set of child nodes under a given parent node are jointly exhaustive and mutually exclusive. Being jointly exhaustive means that every instance of a parent node is also an instance of one of its child nodes. Being mutually exclusive means that no instance of a parent node is an instance of more than one of its child nodes. A corollary is that every instance of the root node is also an instance of one and only one leaf node.
Semantic. The relationships between nodes are often called links. The links between nodes, and between instances and nodes, are based on the meaning of those nodes. Conventionally, node-to-node relationships are called KIND-OF links, because each child node can be said to be a kind of its parent node. In our illustrative taxonomy, shown in Figure 2.1, for example, Supplier is a kind of Organization.
A leaf node, and only a leaf node, can be the direct parent of an instance. Instances are individual things of the type indicated by that node. The relationship between individuals and the (leaf and non-leaf) nodes they are instances of is called an IS-A relationship, because each instance is an instance of its node. Our company may have a supplier, let us suppose, called the Acme Company. In our illustrative taxonomy shown in Figure 2.1, therefore, Acme is a direct instance of a Supplier, and indirectly an instance of an Organization and of a Party. In ordinary conversation, we usually drop the “instance of” phrase, and would say simply that Acme is a supplier, an organization and a party.
Among IT professionals, taxonomies have been used in data models for many years. They are the exclusive subtype hierarchies defined in logical data models, and in the (single-inheritance) class hierarchies defined in object-oriented models. An example familiar to most data modelers is the entity Party. Under it are the two entities Person and Organization. The business rule for this two-level hierarchy is: every party is either a person or an organization (but not both). This hierarchy could be extended for as many levels as are useful for a specific modeling requirement. For example, Organization might be partitioned into Supplier, Self and Customer. This particular taxonomy is shown in Figure 2.1.
We note that most data modelers, on the assumption that this taxonomy would be implemented as a subtype hierarchy in a logical data model, will recognize right away that it is not a very good taxonomy. For one thing, it says that persons are not customers. But many companies do sell their goods or services to people; so for them, this is a bad taxonomy. Either the label “customer” is being used in a specialized (and misleading) way, or else the taxonomy is simply wrong.
A related mistake is that, for most companies, Supplier, Self and Customer are not mutually exclusive. For example, many companies sell their goods or services to other companies who are also suppliers to them. If this is the case, then this hierarchy is not a taxonomy, because an instance—a company that is both a supplier and a customer—belongs to more than one leaf node. As a data modeling subtype hierarchy, it is a non-exclusive hierarchy, not an exclusive one.
This specific taxonomy has nothing to do with temporal data management; but it does give us an opportunity to make an important point that most data modelers will understand. That point is that even very bad data models can be and often are, put into production. And when that happens, the price that is paid is confusion: confusion about what the entities of the model really represent and thus where data about something of interest can be found within the database, what sums and averages over a given entity really mean, and so on.
In this case, for example, some organizations may be represented by a row in only one of these three tables, but other organizations may be represented by rows in two or even in all three of them. Queries which extract statistics from this hierarchy must now be written very carefully, to avoid the possibility of double- or triple-counting organizational metrics.
As well as all this, the company may quite reasonably want to keep a row in the Customer table for every customer, whether it be an organization or a person. This requires an even more confusing use of the taxonomy, because while an organization might be represented multiple times in this taxonomy, at least it is still possible to find additional information about organizational customers in the parent node. But this is not possible when those customers are persons.
So the data modeler will want to modify the hierarchy so that persons can be included as customers. There are various ways to do this, but if the hierarchy is already populated and in use, none of them are likely to be implemented. The cost is just too high. Queries and code, and the result sets and reports based on them, have already been written, and are already in production. If the hierarchy is modified, all those queries and all that code will have to be modified. The path of least resistance is an unfortunate one. It is to leave the whole mess alone, and to rely on the end users to understand that mess as they interpret their reports and query results, and as they write their own queries.
Experienced data modelers may recognize that what is wrong with this taxonomy is that it mixes types and roles. This distinction is often called the distinction between exclusive and non-exclusive subtypes, but data modelers are also familiar with roles, and non-exclusive subtypes are how roles are implemented in a data model. In a hierarchy of roles, things can play multiple roles concurrently; but in a hierarchy of types, each thing is one type of thing, not several types.
Jointly Exhaustive
It is important that the child nodes directly under a parent node are jointly exhaustive. If they aren't, then there can be instances of a node in the hierarchy that are not also instances of any of its immediate child nodes, for example an organization, in Figure 2.1, that is neither a supplier, nor the company itself, nor a customer. This makes that particular node a confusing object. Some of its instances can be found as instances of a node directly underneath it, but others of its instances cannot.
So suppose we have an instance of the latter type. Is it really such an instance? Or is it a mistake? Is an organization without any subtypes really a supplier, for example, and we simply forgot to add a row for it in the Supplier table? Or is it some kind of organization that simply doesn't fit any of the three subtypes of Organization? If we don't have and enforce the jointly exhaustive rule, we don't know. And it will take time and effort to find out. But if we had that rule, then we would know right away that any such situations are errors, and we could move immediately to correct them (and the code that let them through).
For example, consider again our taxonomy containing the three leaf nodes of Supplier, Self and Customer. This set of organization subtypes is based on the mental image of work as transforming material of less value, obtained from suppliers, into products of greater value, which are then sold to customers. The price paid by the customer, less the cost of materials, overhead and labor, is the profit made by the company.
Is this set of three leaf nodes exhaustive? It depends on how we choose to interpret that set of nodes. For example, what about a regulatory agency that monitors volatile organic compounds which manufacturing plants emit into the atmosphere? Is this monitoring agency a supplier or a customer? The most likely way to “shoe-horn” regulatory agencies into this three-part breakdown of Organization is to treat them as suppliers of regulatory services. But it is somewhat unintuitive, and therefore potentially misleading, to call a regulatory agency a supplier. Business users who rely on a weekly report which counts suppliers for them and computes various per-supplier averages may eventually be surprised to learn that regulatory agencies have been counted as suppliers in those reports for as long as those reports have been run.
Perhaps we should we represent regulatory agencies as direct instances of Organization, and not of any of Organization's child nodes. But in that case we have transformed a branch node into a confusing hybrid—a node which is both a branch and a leaf. In either case, the result is unsatisfactory. Business users of the data organized under this hierarchy will very likely misinterpret at least some of their report and query results, especially those less experienced users who haven't yet fully realized how messy this data really is.
Good taxonomies aren't like this. Good taxonomies don't push the problems created by messy data onto the users of that data. Good taxonomies are partitioned semantic trees.
Mutually Exclusive
It is also important for the child nodes directly under a parent node to be mutually exclusive. If they aren't mutually exclusive, then there can be instances of a node in the hierarchy that are also instances of two or more of its immediate child nodes. For example, consider a manufacturing company made up of several dozen plants, these plants being organizations, of course. There might be a plant which receives semi-finished product from another plant and, after working on it, sends it on to a third plant to be finished, packaged and shipped. Is this plant a supplier, a self organization, or a customer? It seems that it is all three. Correctly accounting for costs and revenues, in a situation like this, may prove to be quite complex.
Needless to say, this makes that organizational hierarchy difficult to manage, and its data difficult to understand. Some of its instances can be found as instances of just one node directly under a parent node, but others of its instances can be found as instances of more than one such node.
So suppose we have an instance of the latter type, such as the plant just described. Is it really such an instance? Or is it a mistake, a failure on our part to realize that we inadvertently created multiple child rows to correspond to the one parent row? It will take time and effort to find out, and until we do, we simply aren't sure. Confidence in our data is lessened, and business decisions made on the basis of that data are made knowing that such anomalies exist in the data.
But if we knew that the taxonomy was intended to be a full partitioning, then we would know right away that any such situations are errors. We could monitor the hierarchy of tables and prevent those errors from happening in the first place. We could restore the reliability of the data, and the confidence our users have in it. We could help our company make better business decisions.
Consider another example of violating the mutually exclusive rule. Perhaps when our hierarchy was originally set up, there were no examples of organizations that were instances of more than one of these three categories. But over time, such instances might very well occur, the most common being organizations which begin as suppliers, and then later become customers as well. So when our taxonomy was first created, these three nodes were, at that time, mutually exclusive. The reason we ended up with a taxonomy which violated this rule is that over time, business policy changed. One of our major suppliers wanted to start purchasing products from us; and they were likely to become a major customer. So executive management told IT to accommodate that company as a customer.
By far the easiest way to do this is to relax the mutually exclusive rule for this node of the taxonomy. But to relax the mutually exclusive rule is to change a hierarchy of types into a hierarchy of roles. And since other parts of the hierarchy, supposedly, still reflect the decision to represent types, the result is to mix types and roles in the same hierarchy. It is to make what these nodes of the hierarchy stand for inherently unclear. It is to introduce semantic ambiguity into the basic structures of the database. In this way, over time, as business policy changes, the semantic clarity of data structures such as true taxonomies is lost, and the burden of sorting out the resulting confusion is left to the user.
But after all, what is the alternative? Is it to split off the roles into a separate non-taxonomic hierarchy, and rewrite the taxonomy to preserve the mutually exclusive rule? And then to unload, transform and reload some of the data that originally belonged with the old taxonomy? And then to redirect some queries to the new taxonomy, leave some queries pointed at the original structure which has now become a non-exclusive hierarchy, and duplicate some queries so that one of each pair points at the new non-exclusive hierarchy and the other of the pair points at the new taxonomy, in each case depending on the selection criteria they use? And to train the user community to properly use both the new taxonomy and also the new non-taxonomic hierarchy of non-exclusive child nodes? Any experienced data management professional knows that nothing of the sort is going to happen.
As long as the cost of pushing semantic burdens onto end users is never measured, and seldom even noticed, putting the burden on the user will continue to be the easy way out. “Old hand” users, familiar with the quirks of semantically rococo databases like these, may still be able to extract high-quality information from them. They know which numbers to trust, on which reports, and which numbers to be suspicious of. They know which screens have the most accurate demographic data, and which the most accurate current balances. Less experienced users, on the other hand, inevitably obtain lower-quality information from those same databases. They don't know where the semantic skeletons are hidden. They can tell good data from not so good data about as well as a Florida orange grower can tell a healthy reindeer from one with brucellosis.
And so the gap between the quality of information obtained when an experienced user queries a database, and the quality of information obtained when an average or novice user poses what is supposedly the same question to the database, increases over time. Eventually, the experienced user retires. The understanding of the database which she has acquired over the years retires with her. The same reports are run. The same SQL populates the same screens. But the understanding of the business formed on the basis of the data on those reports and screens is sadly diminished.
The taxonomy we will develop in this chapter is a partitioned semantic hierarchy. In general, any reasonably rich subject matter admits of any number of taxonomies. So the taxonomy described here is not the only taxonomy possible for comparing and contrasting different ways of managing temporal data. It is a taxonomy designed to lead us through a range of possible ways of managing temporal data, and to end up with Asserted Versioning as a leaf node. The contrasts that are drawn at each level of the taxonomy are not the only possible contrasts that would lead to Asserted Versioning. They are just the contrasts which we think best bring out what is both unique and valuable about Asserted Versioning.
A Taxonomy of Methods for Managing Temporal Data
In terms of granularity, temporal data can be managed at the level of databases, or tables within databases, or rows within tables, or even columns within rows. And at each of these levels, we could be managing non-temporal, uni-temporal or bi-temporal data. Of course, with two organizing principles—four levels of granularity, and the non/uni/bi distinction—the result would be a matrix rather than a hierarchy. In this case, it would be a matrix of 12 cells. Indeed, in places in Chapter 1, this alternative organization of temporal data management methods seems to peek out from between the lines. However, we believe that the taxonomy we are about to develop will bring out the similarities and differences among various methods of managing temporal data better than that matrix; and so, from this point forward, we will focus on the taxonomy.
The Root Node of the Taxonomy
The root node of a taxonomy defines the scope and limits of what the taxonomy is about. Our root node says that our taxonomy is about methods for managing temporal data. Temporal data is data about, not just how things are right now, but also about how things used to be and how things will become or might become, and also about what we said things were like and when we said it. Our full taxonomy for temporal data management is shown in Figure 2.2.
B9780123750419000029/f02-02-9780123750419.jpg is missing
Figure 2.2
A Taxonomy of Temporal Data Management Methods.
The two nodes which partition temporal data management are reconstructable data and queryable data. Reconstructable data is the node under which we classify all methods of managing temporal data that require manipulation of the data before it can be queried. Queryable data is the opposite.
Reconstructable Temporal Data
In Chapter 1, we said that the combination of backup files and logfiles permits us to reconstruct the state of a database as of any point in time. That is the only reconstructable method of managing temporal data that we discussed in that chapter. With that method, we retrieve data about the past by restoring a backup copy of that data and, if necessary, applying logfile transactions from that point in time forward to the point in time we are interested in.
But the defining feature of reconstructable methods is not the movement of data from off-line to on-line storage. The defining feature is the inability of users to access the data until it has been manipulated and transformed in some way. For this reason, among all these temporal data management methods, reconstructable temporal data takes the longest to get to, and has the highest cost of access.
Besides the time and effort involved in preparing the data for querying—either through direct queries or via various tools which generate queries from graphical or other user directives—many queries or reports against reconstructed data are modified from production queries or reports. Production queries or reports point to production databases and production tables; and so before they are used to access reconstructed data, they must be rewritten to point to that reconstructed data. This rewrite of production queries and reports may involve changing database names, and sometimes tables names and even column names. Sometimes, a query that accessed a single table in the production database must be modified to join, or even to union, multiple tables when pointed at reconstructed data.
Queryable Temporal Data
Queryable temporal data, in contrast, is data which can be directly queried, without the need to first transform that data in some way. In fact, the principal reason for the success of data warehousing is that it transformed reconstructable historical data into queryable historical data.
Queryable data is obviously less costly to access than reconstructable data, in terms of several different kinds of costs. The most obvious one, as indicated previously, is the cost of the man-hours of time on the part of IT Operations personnel, and perhaps software developers and DBAs as well. Another cost is the opportunity cost of waiting for the data, and the decisions delayed until the data becomes available. In an increasingly fast-paced business world, the opportunity cost of delays in accessing data is increasingly significant.
But in our experience, which combines several decades of work in business IT, the greatest cost is the cost of the business community learning to do without the data they need. In many cases, it simply never crosses their minds to ask for temporal data that isn't already directly queryable. The core of the problem is that satisfying these requests is not the part of the work of computer operators, DBAs and developers that they get evaluated on. If performance reviews, raises, bonuses and promotions depend on meeting other criteria, then it is those other criteria that will be met. Doing a favor for a business user you like, which is what satisfying this kind of one-off request often amounts to, takes a decidedly second place. To paraphrase Samuel Johnson, “The imminent prospect of being passed over for a promotion wonderfully focuses the mind”. 2
2The form in which we knew this quotation is exactly as it is written above, with the word “death” substituted for “being passed over for a promotion”. But in fact, as reported in Boswell's Life of Johnson, what Johnson actually said was: “Depend upon it, sir, when a man knows he is to be hanged in a fortnight, it concentrates his mind wonderfully.” The criteria for annual bonuses do the same thing.
Queryable Temporal Data: Events and States
Having distinguished queryable data from reconstructable data, we move on to a partitioning of the former. We think that the most important distinction among methods of managing queryable data is the distinction between data about things and data about events. Things are what exist; events are what happen. Things are what change; events are the occasions on which they change.
The issue here is change, and the best way to keep track of it. One way is to keep a history of things, of the states that objects take on. As an object changes from one state to the next, we store the before-image of the current state and update a copy of that state, not the original. The update represents the new current state.
Another way to keep track of change is to record the initial state of an object and then keep a history of the events in which the object changed. For example, with insurance policies, we could keep an event-based history of changes to policies by adding a row to the Policy table each time a new policy is created, and after that maintaining a transaction table in which each transaction is an update or delete to the policy. The relevance of transactions to event-based temporal data management is this: transactions are the records of events, the footprints which events leave on the sands of time. 3
3In this book, and in IT in general, transaction has two uses. The first designates a row of data that represents an event. For example, a customer purchase is an event, represented by a row in a sales table; the receipt of a shipment is an event, represented by a row in a receipts table. In this sense, transactions are what are collected in the fact tables of fact-dimension data marts. The second designates any insert, update or delete applied to a database. For example, it is an insert transaction that creates a new customer record, an update transaction that changes a customer's name, and a delete transaction that removes a customer from the database. In general, context will make it clear which sense of the word “transaction” is being used.
Event Temporal Data
Methods for managing event data are most appropriately used to manage changes to metric values of relationships among persistent objects, values such as counts, quantities and amounts. Persistent objects are the things that change, things like policies, clients, organizational structures, and so on. As persistent objects, they have three important features: (i) they exist over time; (ii) they can change over time; and (iii) each is distinguishable from other objects of the same type. In addition, they should be recognizable as the same object when we encounter them at different times (although sometimes the quality of our data is not good enough to guarantee this).
Events are the occasions on which changes happen to persistent objects. As events, they have two important features: (i) they occur at a point in time, or sometimes last for a limited period of time; and (ii) in either case, they do not change. An event happens, and then it's over. Once it's over, that's it; it is frozen in time.
For example, the receipt of a shipment of product alters the on-hand balance of that product at a store. The completion of an MBA degree alters the level of education of an employee. The assignment of an employee to the position of store manager alters the relationship between the employee and the company. Of course, the transactions which record these events may have been written up incorrectly. In that case, adjustments to the data must be made. But those adjustments do not reflect changes in the original events; they just correct mistakes made in recording them.
A Special Relationship: Balance Tables
The event transactions that most businesses are interested in are those that affect relationships that have quantitative measures. A payment is received. This is an event, and a transaction records it. It alters the relationship between the payer and payee by the amount of the payment. That relationship is recorded, for example, in a revolving credit balance, or perhaps in a traditional accounts receivable balance. The payment is recorded as a credit, and the balance due is decreased by that amount.
These records are called balance records because they reflect the changing state of the relationship between the two parties, as if purchases and payments are added to opposite trays of an old-fashioned scale which then tilts back and forth. Each change is triggered by an event and recorded as a transaction, and the net effect of all the transactions, applied to a beginning balance, gives the current balance of the relationship.
But it isn't just the current balance that is valuable information. The transactions themselves are important because they tell us how the current balance got to be what it is. They tell us about the events that account for the balance. In doing so, they support the ability to drill down into the foundations of those balances, to understand how the current state of the relationship came about. They also support the ability to re-create the balance as of any point in time between the starting balance and the current balance by going back to the starting balance and applying transactions, in chronological sequence, up to the desired point.
We no longer need to go back to archives and logfiles, and write one-off code to get to the point in time we are interested in—as we once needed to do quite frequently. Conceptually, starting balances, and the collections of transactions against them, are like single-table backups and their logfiles, respectively, brought on-line. Organized into the structures discovered by Dr. Ralph Kimball, they are fact/dimension data marts.
Of course, balances aren't the only kind of relationship. For example, a Customer to Salesperson cross-reference table—an associative table, in relational terms—represents a relationship between customers and salespersons. This table, among other things, tells us which salespersons a business has assigned to which customers. This table is updated with transactions, but those transactions themselves are not important enough to keep on-line. If we want to keep track of changes to this kind of relationship, we will likely choose to keep a chronological history of states, not of events. A history table of that associative relationship is one way we might keep that chronological history of states.
To summarize: businesses are all about ongoing relationships. Those relationships are affected by events, which are recorded as transactions. Financial account tables are balance tables; each account number uniquely identifies a particular relationship, and the metrical properties of that account tell us the current state of the relationship.
The standard implementation of event time, as we mentioned earlier, is the data mart and the fact/dimension, star or snowflake structures that it uses.
State Temporal Data
Event data, as we have seen, is not the best way of tracking changes to non-metric relationships. It is also not ideal for managing changes to non-metric properties of persistent objects, such as customer names or bill of material hierarchies. Who ever heard of a data mart with customers or bill of material hierarchies as the fact tables? For such relationships and such objects, state-based history is the preferred option. One reason is that, for persistent objects, we are usually more interested in what state they are in at a given point in time than in what changes they have undergone. If we want to know about changes to the status of an insurance policy, for example, we can always reconstruct a history of those changes from the series of states of the policy. With balances, and their rapidly changing metrics, on the other hand, we generally are at least as interested in how they changed over time as we are in what state they are currently in.
So we conclude that, except for keeping track of metric properties of relationships, the best queryable method of managing temporal data about persistent objects is to keep track of the succession of states through which the objects pass. When managing time using state data, what we record are not transactions, but rather the results of transactions, the rows resulting from inserts and (logical) deletes, and the rows representing both a before- and an after-image of every update.
State data describes those things that can have states, which means those things that can change over time. An event, like a withdrawal from a bank account, as we have already pointed out, can't change. Events don't do that. But the customer who owns that account can change. The branch the account is with can change. Balances can also change over time, but as we have just pointed out, it is usually more efficient to keep track of balances by means of periodic snapshots of beginning balances, and then an accumulation of all the transactions from that point forward.
But from a logical point of view, event data and state data are interchangeable. No temporal information is preserved with one method that cannot be preserved with the other. We have these two methods simply because an event data approach is preferable for describing metric-based relationships, while a state data approach is better at tracking changes to persistent objects and to relationships other than metric balances.
State Temporal Data: Uni-Temporal and Bi-Temporal Data
At this point in our discussion, we are concerned with state data rather than with event data, and with state data that is queryable rather than state data that needs to be reconstructed. What then are the various options for managing temporal queryable state data?
First of all, we need to recognize that there are two kinds of states to manage. One is the state of the things we are interested in, the states those things pass through as they change over time. But there is another kind of state, that being the state of the data itself. Data, such as rows in tables, can be in one of two states: correct or incorrect. (As we will see in Chapter 12, it can also be in a third state, one in which it is neither correct nor incorrect.) Version tables and assertion tables record, respectively, the state of objects and the state of our data about those objects.
Uni-Temporal State Data
In a conventional Customer table, each row represents the current state of a customer. Each time the state of a customer changes, i.e. each time a row is updated, the old data is overwritten with the new data. By adding one (or sometimes two) date(s) or timestamp(s) to the primary key of the table, it becomes a uni-temporal table. But since we already know that there are two different temporal dimensions that can be associated with data, we know to ask “What kind of uni-temporal table?”
As we saw in the Preface, there are uni-temporal version tables and uni-temporal assertion tables. Version tables keep track of changes that happen in the real world, changes to the objects represented in those tables. Each change is recorded as a new version of an object. Assertion tables keep track of corrections we have made to data we later discovered to be in error. Each correction is recorded as a new assertion about the object. The versions make up a true history of what happened to those objects. The assertions make up a virtual logfile of corrections to the data in the table.
Usually, when table-level temporal data is discussed, the tables turn out to be version tables, not assertion tables. In their book describing the alternative temporal model [2002, Date, Darwen, Lorentzos], the authors focus on uni-temporal versioned data. Bi-temporality is not even alluded to until the penultimate chapter, at which point it is suggested that “logged time history” tables be used to manage the other temporal dimension. Since bi-temporality receives only a passing mention in that book, we choose to classify the alternative temporal model as a uni-temporal model.
In IT best practices for managing temporal data—which we will discuss in detail in Chapter 4—once again the temporal tables are version tables, and error correction is an issue that is mostly left to take care of itself. 4 For the most part, it does so by overwriting incorrect data. 5 This is why we classify IT best practices as uni-temporal models.
4Lacking criteria to distinguish the best from the rest, the term “best practices” has come to mean little more than “standard practices”. What we call “best practices”, and which we discuss in Chapter 4, are standard practices we have seen used by many of our clients.
5An even worse solution is to mix up versions and assertions by creating a new row, with a begin date of Now(), both every time there is a real change, and also every time there is an error in the data to correct. When that happens, we no longer have a history of the changes things went through, because we cannot distinguish versions from corrections. And we no longer have a “virtual logfile” of corrections because we don't know how far back the corrections should actually have taken effect.
The Alternative Temporal Model
What we call the alternative temporal model was developed by Chris Date, Hugh Darwen and Dr. Nikos Lorentzos in their book Temporal Data and the Relational Model (Morgan-Kaufmann, 2002). 6 This model is based in large part on techniques developed by Dr. Lorentzos to manage temporal data by breaking temporal durations down into temporally atomic components, applying various transformations to those components, and then re-assembling the components back into those temporal durations—a technique, as the authors note, whose applicability is not restricted to temporal data.
6The word “model”, as used here and also in the phrases “alternative model” and “Asserted Versioning model” obviously doesn't refer to a data model of specific subject matter. It means something like theory, but with an emphasis on its applicability to real-world problems. So “the relational model”, as we use the term, for example, means something like “relational theory as implemented in current relational technology”.
As we said, except for the penultimate chapter in that book, the entire book is a discussion of uni-temporal versioned tables. In that chapter, the authors recommend that if there is a requirement to keep track of the assertion time history of a table (which they call “logged-time history”), it be implemented by means of an auxiliary table which is maintained by the DBMS.
In addition, these authors do not attempt, in their book, to explain how this method of managing temporal data would work with current relational technology. Like much of the computer science research on temporal data, they allude to SQL operators and other constructs that do not yet exist, and so their book is in large part a recommendation to the standards committees to adopt the changes to the SQL language which they describe.
Because our own concern is with how to implement temporal concepts with today's technologies, and also with how to support both kinds of uni-temporal data, as well as fully bi-temporal data, we will have little more to say about the alternative temporal model in this book.
Best Practices
Over several decades, a best practice has emerged in managing temporal queryable state data. It is to manage this kind of data by versioning otherwise conventional tables. The result is versioned tables which, logically speaking, are tables which combine the history tables and current tables described previously. Past, present and future states of customers, for example, are kept in one and the same Customer table. Corrections may or may not be flagged; but if they are not, it will be impossible to distinguish versions created because something about a customer changed from versions created because past customer data was entered incorrectly. On the other hand, if they are flagged, the management and use of these flags will quickly become difficult and confusing.
There are many variations on the theme of versioning, which we have grouped into four major categories. We will discuss them in Chapter 4.
The IT community has always used the term “version” for this kind of uni-temporal data. And this terminology seems to reflect an awareness of an important concept that, as we shall see, is central to the Asserted Versioning approach to temporal data. For the term “version” naturally raises the question “A version of what?”, to which our answer is “A version of anything that can persist and change over time”. This is the concept of a persistent object, and it is, most fundamentally, what Asserted Versioning is about.
Bi-Temporal State Data
We now come to our second option, which is to manage both versions and assertions and, most importantly, their interdependencies. This is bi-temporal data management, the subject of both Dr. Rick Snodgrass's book [2000, Snodgrass] and of our book.
The Standard Temporal Model
What we call the standard temporal model was developed by Dr. Rick Snodgrass in his book Developing Time-Oriented Database Applications in SQL (Morgan-Kaufmann, 2000). Based on the computer science work current at that time, and especially on the work Dr. Snodgrass and others had done on the TSQL (temporal SQL) proposal to the SQL standards committees, it shows how to implement both uni-temporal and bi-temporal data management using then-current DBMSs and then-current SQL.
We emphasize that, as we are writing, Dr. Snodgrass's book is a decade old. We use it as our baseline view of computer science work on bi-temporal data because most of the computer science literature exists in the form of articles in scientific journals that are not readily accessible to many IT professionals. We also emphasize that Dr. Snodgrass did not write that book as a compendium of computer science research for an IT audience. Instead, he wrote it as a description of how some of that research could be adapted to provide a means of managing bi-temporal data with the SQL and the DBMSs available at that time.
One of the greatest strengths of the standard model is that it discusses and illustrates both the maintenance and the querying of temporal data at the level of SQL statements. For example, it shows us the kind of code that is needed to apply the temporal analogues of entity integrity and referential integrity to temporal data. And for any readers who might think that temporal data management is just a small step beyond the versioning they are already familiar with, many of the constraint-checking SQL statements shown in Dr. Snodgrass's book should suffice to disabuse them of that notion.
The Asserted Versioning Temporal Model
What we call the Asserted Versioning temporal model is our own approach to managing temporal data. Like the standard model, it attempts to manage temporal data with current technology and current SQL.
The Asserted Versioning model of uni-temporal and bi-temporal data management supports all of the functionality of the standard model. In addition, it extends the standard model's notion of transaction time by permitting data to be physically added to a table prior to the time when that data will appear in the table as production data, available for use. This is done by means of deferred transactions, which result in deferred assertions, those being the inserted, updated or logically deleted rows resulting from those transactions. 7 Deferred assertions, although physically co-located in the same tables as other data, will not be immediately available to normal queries. But once time in the real world reaches the beginning of their assertion periods, they will, by that very fact, become currently asserted data, part of the production data that makes up the database as it is perceived by its users.
7The term “deferred transaction” was suggested by Dr. Snodgrass during a series of email exchanges which the authors had with him in the summer of 2008.
We emphasize that deferred assertions are not the same thing as rows describing what things will be like at some time in the future. Those latter rows are current claims about what things will be like in the future. They are ontologically post-dated. Deferred assertions are rows describing what things were, are, or will be like, but rows which we are not yet willing to claim make true statements. They are epistemologically post-dated.
Another way that Asserted Versioning differs from the standard temporal model is in the encapsulation and simplification of integrity constraints. The encapsulation of integrity constraints is made possible by distinguishing temporal transactions from physical transactions. Temporal transactions are the ones that users write. The corresponding physical transactions are what the DBMS applies to asserted version tables. The Asserted Versioning Framework (AVF) uses an API to accept temporal transactions. Once it validates them, the AVF translates each temporal transaction into one or more physical transactions. By means of triggers generated from a combination of a logical data model together with supplementary metadata, the AVF enforces temporal semantic constraints as it submits physical transactions to the DBMS.
The simplification of these integrity constraints is made possible by introducing the concept of an episode. With non-temporal tables, a row representing an object can be inserted into that table at some point in time, and later deleted from the table. After it is deleted, of course, that table no longer contains the information that the row was ever present. Corresponding to the period of time during which that row existed in that non-temporal table, there would be an episode in an asserted version table, consisting of one or more temporally contiguous rows for the same object. So an episode of an object in an asserted version table is in effect during exactly the period of time that a row for that object would exist in a non-temporal table. And just as a deletion in a conventional table can sometime later be followed by the insertion of a new row with the same primary key, the termination of an episode in an assertion version table can sometime later be followed by the insertion of a new episode for the same object.
In a non-temporal table, each row must conform to entity integrity and referential integrity constraints. In an asserted version table, each version must conform to temporal entity integrity and temporal referential integrity constraints. As we will see, the parallels are in more than name only. Temporal entity integrity really is entity integrity applied to temporal data. Temporal referential integrity really is referential integrity applied to temporal data.
Glossary References
Glossary entries whose definitions form strong inter-dependencies are grouped together in the following list. The same glossary entries may be grouped together in different ways at the end of different chapters, each grouping reflecting the semantic perspective of each chapter. There will usually be several other, and often many other, glossary entries that are not included in the list, and we recommend that the Glossary be consulted whenever an unfamiliar term is encountered.
as-is
as-was
Asserted Versioning
Asserted Versioning Framework (AVF)
episode
persistent object
state
thing
physical transaction
temporal transaction
temporal entity integrity (TEI)
temporal referential integrity (TRI)
the alternative temporal model
the Asserted Versioning temporal model
the standard temporal model
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset