Chapter 14.1

Data Models Across the End-State Architecture

Abstract

One of the essences of data architecture is an overall blueprint of the architecture. The composite architecture combining all data is shown to contain different layers of data. There are interactive data, archival data, integrated data, and other forms of data. The composite data architecture recognizes the life cycle of data within the corporation. Another view of architecture is that of how corporate data interface with big data.

Keywords

Architectural blueprint; The composite architecture; The lifecycle of data; Big data; Textual disambiguation; Repetitive data; Nonrepetitive data

There are different kinds of data models that are found throughout the end-state architecture. The data models provide an “intellectual road map” as to what data are to be found in the end-state architecture. The value of an intellectual road map is shown by going on a road trip across the United States. Suppose you set out from the East Coast. You drive to places you have never been before—New Mexico, the Grand Canyon, Yellowstone, Santa Fe, Denver, and other places. How do you navigate from one location to the next? You use a road map. The road map tells you where you are right now and how to get to where you are going next.

The data models of the end-state architecture provide the same function. They tell you what to expect to find and how to get to where you will find something else.

The Different Data Models

The different types of data models found in the end-state architecture are shown in Fig. 14.1.1.

Fig. 14.1.1
Fig. 14.1.1 Data models across the architecture.

The end-state architecture data models include the following:

  • The application functional decomposition and data flow diagram
  • The corporate data model
  • Taxonomies for text
  • The dimensional data model for data marts
  • The selective subdivision of the data lake

Each of these data models and their relation to each other will be discussed.

Functional Decomposition and Data Flow Diagrams

In the world of applications, there are the functional decomposition and the data flow diagram.

Fig. 14.1.2 depicts these constructs,

Fig. 14.1.2
Fig. 14.1.2 The application environment.

The functional decomposition is the depiction of the functions that will be achieved by a system. The functional decomposition is laid out in a hierarchical fashion. At the top of the decomposition is the general function of what is to be accomplished by the system. At the second level are the main functions of what is to be accomplished. Then, each second level function is broken down into its subfunctions, until the point of basic functionality is reached.

The functional decomposition is useful to see what the different activities of a system will be. It is useful for organizing the functions, identifying overlap, and checking to see if anything is left out. When you are setting out on a long trip, it is useful for looking at a map of the United States to see what states you will visit and the order in which the states will be traveled.

After the functional decomposition is completed, the next step is to create data flow diagrams for each of the functions. The data flow diagram starts with the input to the module and shows how the input data will be processed to achieve the output data. The three major components of a data flow diagram are an identification of the input, a description of the logic that will occur in the module, and a description of the output.

If the functional decomposition is like a map of the United States, the data flow diagram is like a detailed map of a state. The data flow diagram tells you how to get across Texas. You start at El Paso, you head east, past McKittrick Canyon, go to Van Horn and Sierra Blanca, go through Pecos, then on to Midland and Odessa, and so forth. The map of Texas shows details that the map of the United States cannot show. By the same token, the map of Texas does not show you how to get from Los Angeles to San Jose or from Chicago to Naperville.

The nature of functional decomposition and data flow diagrams are such that process and data are intimately intertwined. Both process and data are needed in order to build a functional decomposition and data flow diagrams.

Fig. 14.1.3 shows the tight interrelationship of data and process in the functional decomposition.

Fig. 14.1.3
Fig. 14.1.3 Process and data are in lock step.

The building of functional decompositions and data flow diagrams are used to define and build applications. As a rule, these constructs can be very complex. One of the tools that are used in order to manage the complexity is that of the definition of the scope of development. At the very beginning, there is an exercise that requires that the scope of the application be defined. The scope definition is necessary in order to keep the size of the development reasonable. If a designer is not careful, the scope will become so large that the system will never be built. Therefore, it is necessary to rigorously define the scope before the development effort ever begins.

The result of the definition of the scope is that—over time—the organization ends up with multiple applications, each of which have their own functional decompositions and data flow diagrams.

Fig. 14.1.4 shows that over time, each application has its own set of definitions.

Fig. 14.1.4
Fig. 14.1.4 Each application has its own functional decomposition and set of data flow diagrams.

While the development process that has been described is normal for almost every shop, there is a problem. Over time, a serious amount of overlap between different applications starts to emerge. Because of the need to define and enforce the definition of the scope of an application rigorously, the same or similar functionality starts to appear across multiple applications. When this happens, there start to appear redundant data. The same or similar data element appears in multiple applications.

The Corporate Data Model

When redundant data start to appear, the very integrity of the data comes into question. It is because of this method of developing systems and the inevitable lack of integrity of data across applications that there is recognition of the need for corporate data, not application data.

Fig. 14.1.5 shows the corporate data model.

Fig. 14.1.5
Fig. 14.1.5 The corporate data model.

The corporate data model applies to and is useful for everyone at the company. The different organizations that make use of the corporate data model are shown in Fig. 14.1.6.

Fig. 14.1.6
Fig. 14.1.6 The corporate data model represents all the corporation.

At first glance, the thought of a corporate data model may seem overwhelming. However, the good news is that most corporate data models do not have to be built from scratch. Consider the fact that within the same companies in an industry, there is a high degree of duplication of a data model. The data model for a bank will be very similar to the data model for other banks. The data model for a public utility will be very similar to other data models for other public utilities. The data model for a manufacturer will be very similar to other data models for other manufacturers, and so forth.

Because of the great similarity of data models within the same industry, there are what are called generic data models. It is easy enough and inexpensive enough to simply buy a generic data model and to customize that data model for a particular company.

Further simplifying the matter is the fact that the data model is built for only the primitive data in the corporation. Summarized, aggregated, or derived data do not belong in the data model.

Fig. 14.1.7 shows that only granular data belong in the corporate data model.

Fig. 14.1.7
Fig. 14.1.7 The corporate data model represents granular data only.

The corporate data model represents the single version of the truth data in the corporation. Corporate data are the place where everyone turns when they have to have a reliable accurate answer.

One of the challenges is the fact that corporate data are usually fed by application data. And application data are decidedly not the single version of the truth in the corporation.

For this reason, the interface between the application data model and the corporate data model is important and needs to be carefully defined. The interface between the application data model and the corporate data model defines an important transformation of data. Once the interface has been rigorously defined, it is easy enough for the programmer to write a program to accomplish the transformation.

Fig. 14.1.8 shows that multiple application models connect to the corporate data model.

Fig. 14.1.8
Fig. 14.1.8 Each application has its own interface to the corporate data model.

There are many uses for the corporate data model. But the primary use of the model is to form the basis of database design for the data warehouse.

Fig. 14.1.9 shows that the corporate data model is the basic specification of the data warehouse.

Fig. 14.1.9
Fig. 14.1.9 The corporate data model forms the basis for the design of the data warehouse.

The Star Join/Dimensional Data Model

Another type of data model found in the end-state architecture is the dimensional model. The dimensional model consists of a fact table and multiple connected dimensions. The result is what is termed a “star join.”

Fig. 14.1.10 depicts a star join.

Fig. 14.1.10
Fig. 14.1.10 A star join.

The star join reflects the needs of the different departments that will be using the data influenced by the star join. Stated differently, there will be a star join for marketing. There will be a different star join for sales. There will be another star join for finance, one for marketing, and so forth.

The reason why there will be a different star join is that the different departments look at data differently. The star join for a department reflects the customized view of data for the department.

Fig. 14.1.11 shows that there are different star joins for each department.

Fig. 14.1.11
Fig. 14.1.11 The star join and data marts.

The source of data for the star join is the corporate data model. Even though the star join reflects a particular customized view of data, the source of the data is uniform. The source of the data is still the single source of truth for the corporation.

It is noteworthy that it is possible to build a data mart whose source of data is not the data warehouse. While such a structure can be built, it is outside the boundaries of the end-state architecture. Building a data mart whose source of data is not the data warehouse is like violating the zoning codes of a city. You could build a hovel next to a large office building. But if you do, you will have a poorly planned city. And there are a whole host of other problems that come with having a poorly planned city.

Fig. 14.1.12 shows that star join environment is fed from the corporate data model.

Fig. 14.1.12
Fig. 14.1.12 The corporate data model forms the basis for the design of the different, customized data marts.

Taxonomies/Ontologies

Another important form of a data model is the taxonomy. The taxonomy is the form of a data model used to shape and manage text. Text is free form. When an author sits down to write a document, the author can compose the document however he/she wishes. Writing—for the most part—is free form.

The data models that fit elsewhere in the end-state architecture simply do not fit text. A wholly different approach is needed for examining and using text in the decision-making infrastructure.

Fig. 14.1.13 depicts the taxonomy that is used to integrate text into the end-state architecture.

Fig. 14.1.13
Fig. 14.1.13 Taxonomies—based on text.

The taxonomy—strictly speaking—can take the form of a taxonomy or an ontology. In its simplest form, the taxonomy is merely a collection of classifications. The classification consists of a classification and a list of words that populate that classification. The categories found in the taxonomy reflect the viewpoint of the author.

Fig. 14.1.14 shows that taxonomies are made up of categories and words.

Fig. 14.1.14
Fig. 14.1.14 Categories and words.

The taxonomies used to understand a document are relevant to the business being discussed in the document. The taxonomies are used to determine the context of the words that are being written. In fairness, there is a lot more to understanding context than merely using a taxonomy on a document. However, the taxonomy is the starting point.

Once the taxonomy is used and once contextualization is done, the text is turned into a database. In essence, the text found in the document being processed is “normalized.”

Fig. 14.1.15 shows that a database is created from the document using taxonomies.

Fig. 14.1.15
Fig. 14.1.15 Taxonomies—used to shape the data base into which the text is normalized.

The categories found in the taxonomy are dependent on the writer of the document. The designer uses the appropriate set of taxonomies to relate the document to the data that are going to be analyzed.

In actuality, there are almost infinite numbers of taxonomies. The analyst chooses which taxonomies are the most appropriate to the data that will go into the database.

It is very normal for there to be overlap between words found in different taxonomies.

The categories of the taxonomy ROUGHLY are equivalent to the entities found in the corporate data model. It is noted that the correlation between the categories of the taxonomy and the entities of the corporate data model is NOT an exact match. There can be many differences between the two, so that the correlation is an imperfect match.

Nevertheless, there is a ROUGH approximation between the two types of elements.

Fig. 14.1.16 shows the rough approximation between the two types of data model.

Fig. 14.1.16
Fig. 14.1.16 Categories and entities.

The Selective Subdivision of Data

The final form of data modeling found in the end-state architecture is the selective subdivision of data found in the data lake. It can be argued that the selective subdivision of data in the data lake is not a data model at all. Indeed, all the selective subdivision of data in the data lake is determined by the organization of data according to certain characteristics of the data. There may be an archival subdivision of data; a litigation support subdivision of data; an extended, bulk data warehouse subdivision of data; and so forth in the data lake.

Fig. 14.1.17 shows the selective subdivision of data in the data lake.

Fig. 14.1.17
Fig. 14.1.17 Subdividing the data lake.

The selective subdivision of data really does not affect the content or design of data in the data lake. Instead, the selective subdivision merely influences the placement of data in the data lake.

When it comes to the shaping of data found in the data lake, the single largest factor in the shaping of data is the corporate data model, as seen in Fig. 14.1.18.

Fig. 14.1.18
Fig. 14.1.18 The basic shape of the data model.

The preceding discussion has included all the forms of data modeling found in the end-state architecture. The functional decomposition and the data flow diagrams apply to applications. The taxonomy/ontology applies to text. The corporate data model applies to the data warehouse. The dimensional model applies to data marts. And the selective subdivision of data applies to the data lake.

Each of these forms of data modeling has their own idiosyncrasies. Each form of data modeling has a certain similarity to the other forms of data modeling. And each form of data modeling is required in order to build an effective end-state architecture.

Proactive/Reactive Data Models

One of the interesting features of the end-state architecture is the ability of the analyst to traverse and communicate from one form of data modeling to another. In other words, when an analyst is working on the corporate data model, the analyst can go look at what the data flow diagrams look like. Or when an analyst is working on a taxonomy, the analyst can look at the corporate data model. Or when an analyst is assigning the selective subdivision of data, the analyst can go look at the functional decomposition of data.

The ability to traverse the network of information of data formed by the different forms of data modeling in the end-state architecture is a very important feature. By being able to traverse the network formed by the different forms of data modeling, the analyst can find and examine the lineage of the data. By examining the lineage of data, the analyst can understand such things as the following:

  • Where did the data come from?
  • What data were chosen?
  • What data were not chosen?
  • What calculations were made on the data?
  • When were the calculations applied?

In a word, the ability to be able to traverse the network of the different forms of data models in the end-state architecture is one of the more important features of the end-state architecture.

Fig. 14.1.19 shows the network.

Fig. 14.1.19
Fig. 14.1.19 The role of metadata.

There is one important difference between the different forms of data models that must be noted. That difference is that some forms of data models shape the data. But other forms of data models are shaped by the data. Stated differently, some forms of data modeling are proactive and caused the data to be shaped after the model. But other forms of data modeling are reactive and are shaped by the data.

The corporate data model, functional decomposition and data flow diagrams, and the dimensional data model are proactive. The taxonomy/ontology data model and the selective subdivision of data are reactive.

Fig. 14.1.20 shows this property of the different forms of data modeling in the end-state architecture.

Fig. 14.1.20
Fig. 14.1.20 Fundamental differences in models.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset