Chapter 2.1

The End-State Architecture—The “World Map”

Abstract

End-state architecture is the ultimate destination of architecture as it evolves. The end-state architecture encompasses the TOTALITY of data in the corporation—personal data, textual data, big data, transaction data, and so forth. It is recognized that it is an evolution that takes the organization into the end-state architecture. This evolution occurs over time and occurs for many reasons, all of whom are interacting with each other over time.

Keywords

End state architecture; Textual ETL; Data warehouse; Data vault; ODS; Data mart; Archival; Bulk data warehouse; Data lake; Data pond; Landing zone; Metadata; Data models

There is a data architecture to which corporations are evolving. That data architecture can be called the “end state” data architecture or the “world map” of data.

Architectural Components

Fig. 2.1.1 depicts the “end state” data architecture.

Fig. 2.1.1
Fig. 2.1.1 The end state architecture. Copyright Bill Inmon, 2018.

The different components of the end-state data architecture are as follows:

  • Text—the text that belongs to the corporation that is worthy of inclusion into the end state
  • Textual ETL—the process that transforms text into a standard database format
  • The data warehouse—the place where the corporate single version of the truth resides
  • The data vault—the component of the data warehouse where rigorous data governance can be done
  • Data marts—the place where individual departments have their customized analytic data
  • Applications—the operational applications where day-to-day transactions are run
  • ETL—the process by which application data are transformed into corporate data
  • ODS—operational data store—a hybrid structure where integrated data can be quickly accessed online
  • The archival facility—the process by which older data are removed from active analysis
  • The refine process—the process by which bulk data are processed and entered into active analysis
  • Bulk data warehouse—the data warehouse where the single version of the truth data with a low probability of access is stored
  • The bulk data vault—the data vault where data with a low probability of access are stored where rigorous data governance can be done
  • The data lake—the place where very large volumes of data are stored
  • The bulk data mart—the data mart built to manage large volumes of data
  • The data pond—the place where selective subsets of data are stored
  • Automated generation of data—the mechanism by which large volumes of data are generated
  • The landing zone—the place where large volumes of data are first touched by the system and are available for processing
  • Data lake transformation—the process by which large volumes of data are edited and manipulated

Each of these components will be defined and discussed throughout this book. Each of these components has their own properties. There is a distinct value to each of these components.

Different Kinds of Data in the End State Architecture

There are many ways to understand the end-state architecture. One of the easiest ways to understand the architecture is to examine the different kinds of data that are found in different places.

Fig. 2.1.2 describes some of the different kinds of data found throughout the architecture.

Fig. 2.1.2
Fig. 2.1.2 Different kinds of data throughout the end state architecture. Copyright Bill Inmon, 2018.

Text can be either spoken or written. Text can be transformed through the auspices of voice to text transcription. Written text—if it is not already in the form of electronic text—can be captured and transformed by optical character recognition, OCR. However, the text exists; it is prepared into the form of electronic text.

Transaction data are data that have been captured as the by-product of the execution of a transaction. There are many kinds of transactions. There are bank teller transactions, ATM transactions, airline reservations, retail purchases, credit card activity, inventory management transactions, payment ledger transactions, and many more. These transactions are usually run by applications. As a rule, applications are developed and built in a “siloed” fashion. This means that when one application is built, it does not take into consideration the other applications with which it must interact. Corporations end up with a whole collection of applications, each one of which acts independently. The result is unintegrated application data.

Corporate data are data that have entered the system and then have been transformed into an integrated corporate state. The transformation moves the data from being application-oriented data to a data warehouse where the data are integrated into a corporate state. As a simple example of corporate integration, application A has gender as male/female, application B has gender designated as x/y, and application C has gender designated as 1/0. The corporate standard for the designation of gender is m/f. The application data are converted as they were moved into the data warehouse from the application.

The data marts contain data that are customized for the different groups that will be analytically using the data. Typically, there are data marts for marketing, sales, finance, and others. The source of data for the data marts is the data warehouse.

The data lake contains a variety of data. Some of the data found in the data lake are archival data. Other data in the data lake are simply bulk data. And it is possible to build a bulk data warehouse in the data lake. In addition, the bulk data warehouse may contain a bulk data vault. The bulk data warehouse is the single version of the truth for bulk amounts of data.

The data ponds are the subsets of the data lake that are set aside for different purposes. There may be an archival data pond, a litigation support data pond, a general purpose data pond, a manufacturing data pond, an analog data pond, and so forth.

Shaping the Data Through Models

Each of the different types of data in the end-state architecture is shaped by different types of data models. There are different kinds of data models that are suited to the different environments. The data model that is found in the many places where data are shaped throughout the architecture serves as an intellectual information paradigm for the building of applications, data warehouses, data marts, etc.

Fig. 2.1.3 shows the different kinds of data models that are found in the end-state architecture.

Fig. 2.1.3
Fig. 2.1.3 Different data modelling techniques throughout the end state architecture. Copyright Bill Inmon, 2018.

Applications are typically shaped by functional decompositions and data flow diagrams. The data found in text are shaped by taxonomies. The data warehouse is shaped by the corporate data model, usually consisting of an entity relationship diagram (ERD), a data item set (dis), and a physical model. The data marts are shaped by the dimensional model, consisting of star joins, fact tables, and dimensions. The data vault is shaped by the data vault data model.

The data lake is shaped by the selective subdivision of data.

Where Is the Data Warehouse?

One of the important questions that quickly arises is are there two data warehouses—a standard data warehouse and a bulk data warehouse? Fig. 2.1.4 outlines this question.

Fig. 2.1.4
Fig. 2.1.4 A physical data warehouse and a logical data warehouse. Copyright Bill Inmon, 2018.

The answer to that question is a little less than straightforward. From a physical standpoint, there are indeed two data warehouses—a standard data warehouse and a bulk data warehouse. But from a logical standpoint, there is only one data warehouse. The physical possibilities for a data warehouse are the following:

  • A standard data warehouse
  • A bulk data warehouse
  • A standard data warehouse and a bulk data warehouse

The confusion arises when a data warehouse is built inside a data lake, as is certainly a possibility. The data lake resides on physically different technology (i.e., big data) than the standard data warehouse (which typically resides on relational technology).

However, even though there are physically two different data warehouses, there should never be any overlap of data from the standard data warehouse to the bulk data warehouse. Therefore, there is logically one data warehouse that is physically implemented over two environments.

There are several advantages to this “duplexed” approach. One advantage is that the data warehouse can grow to any size. Another advantage is that the data warehouse infrastructure cost is minimized. Both of these advantages are quite attractive to most organizations.

Where Different Types of Questions Are Answered Across the End State Architecture

Yet, another way to understand the end-state architecture is to look at the different types of questions that are answered in different places in the end-state architecture.

Fig. 2.1.5 shows the possibilities.

Fig. 2.1.5
Fig. 2.1.5 Different information across the end state architecture.

The raw text is captured and analyzed when someone asks the question—“can I get a loan?” The question itself becomes the basic data that go into the database.

Operational transaction questions relate to specific instances and values of data. When you say “what is my account balance?” you want to know exactly how much money you have in your account right now. You want the correct answer, and you want the answer very quickly.

Now, suppose you want to know your average monthly account balance for the past 5 years. That data will not be in the online application database. Instead, you need to look for specific data over time. The place to find that data is the data warehouse.

Suppose you want to examine the spending habits of all customers who deposit more than $1000 a month in their account. You need to look at all of that data in order to satisfy a special study. You might look for these data in a data mart. The processing you do here is of an analytic nature.

Now, suppose you are being audited by the IRS. You need to go back 10 years to show that a check was written a decade ago. You would go to your bulk data warehouse in the data lake.

The factors that determine where data are placed include the following:

  • How much data are there?
  • How old are the data?
  • How quickly do the data have to be retrieved?
  • Can the data be updated?

Data in different places have different properties. And those properties affect their usage.

Data in the Data Lake

There can exist different kinds of data in the data lake. There are several reasons why data are placed in a data lake:

  • The probability of access of the data has dropped significantly.
  • There are so much data that there is no better place to put the data.
  • The data have aged.
  • The usage of the data does not warrant being placed elsewhere.

Accordingly, data are placed in the data lake.

However, just because data have been placed in a data lake does not mean that the data are (or are not) in a data warehouse. It is entirely possible that the data warehouse has been extended into the data lake.

Fig. 2.1.6 shows the data in the data lake.

Fig. 2.1.6
Fig. 2.1.6 Data in the data lake. Copyright Bill Inmon, 2018.

Metadata in the End State Architecture

It is not obvious when looking at the end-state architecture, but there is another important part of the architecture that is transparent. That part of the architecture is the metadata infrastructure that overlays each component of the end-state architecture.

The metadata are descriptive of the data that lie within the end-state architecture. The metadata are useful to designers, programmers, and end users. In a word, anyone who must find their way around the architecture needs to use the metadata.

It is noteworthy that each component has its own metadata and that both the component and the metadata for that component are different from one component to the next. In other words, the metadata for text look different than the metadata for applications, which also are different from the metadata for the data warehouse, and so on.

Fig. 2.1.7 shows the metadata infrastructure associated with the end-state architecture.

Fig. 2.1.7
Fig. 2.1.7 The metadata infrastructure. Copyright Bill Inmon, 2018.

Networked Metadata

Another feature of the metadata infrastructure is that the metadata infrastructure is networked. A person looking at one collection of metadata can easily traverse to another collection of metadata. And—if desired—the analyst can exchange metadata from one metadata collection to the next.

Fig. 2.1.8 shows the ability to network the metadata across the architecture.

Fig. 2.1.8
Fig. 2.1.8 Networked metadata. Copyright Bill Inmon, 2018.

An Evolutionary Experience

Another typical question that arises with the end-state architecture is how it is built. In a word, arriving at the end-state architecture is an evolutionary experience. No one sits down and builds all of the end-state architecture at once. Such an undertaking is too large, too complex, and too expensive. Instead, the end-state architecture grows over time.

Some organizations start building one direction. Other organizations start building another direction. Some organizations build part of the end-state architecture and never build other parts.

There is no one “correct” path to the evolution of the architecture.

Fig. 2.1.9 shows that there are many paths to the building of the end-state architecture.

Fig. 2.1.9
Fig. 2.1.9 The evolving architecture.

The Data Lake Architecture

The architecture component surrounding the data lake deserves a deeper explanation. In front of the data lake is a mechanism for capturing and prepping the data about to enter the data lake from external sources of data. There are several reasons for the need for an elaborate interface. The primary reasons for the need for an ingestion interface are the following:

  • Data arrive so fast that the data lake cannot ingest the data as rapidly as it is generated.
  • There are so much data that some sort of landing zone is appropriate.
  • Raw editing of data needs to be employed before the data arrive in the data lake. In some cases, data are discarded. In other cases, data are categorized. In yet other cases, data are refurbished before their entry into the data lake.

Fig. 2.1.10 shows the data lake infrastructure.

Fig. 2.1.10
Fig. 2.1.10 The data lake infrastructure. Copyright Bill Inmon, 2018.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset