End-state architecture is the ultimate destination of architecture as it evolves. The end-state architecture encompasses the TOTALITY of data in the corporation—personal data, textual data, big data, transaction data, and so forth. It is recognized that it is an evolution that takes the organization into the end-state architecture. This evolution occurs over time and occurs for many reasons, all of whom are interacting with each other over time.
End state architecture; Textual ETL; Data warehouse; Data vault; ODS; Data mart; Archival; Bulk data warehouse; Data lake; Data pond; Landing zone; Metadata; Data models
There is a data architecture to which corporations are evolving. That data architecture can be called the “end state” data architecture or the “world map” of data.
Fig. 2.1.1 depicts the “end state” data architecture.
The different components of the end-state data architecture are as follows:
Each of these components will be defined and discussed throughout this book. Each of these components has their own properties. There is a distinct value to each of these components.
There are many ways to understand the end-state architecture. One of the easiest ways to understand the architecture is to examine the different kinds of data that are found in different places.
Fig. 2.1.2 describes some of the different kinds of data found throughout the architecture.
Text can be either spoken or written. Text can be transformed through the auspices of voice to text transcription. Written text—if it is not already in the form of electronic text—can be captured and transformed by optical character recognition, OCR. However, the text exists; it is prepared into the form of electronic text.
Transaction data are data that have been captured as the by-product of the execution of a transaction. There are many kinds of transactions. There are bank teller transactions, ATM transactions, airline reservations, retail purchases, credit card activity, inventory management transactions, payment ledger transactions, and many more. These transactions are usually run by applications. As a rule, applications are developed and built in a “siloed” fashion. This means that when one application is built, it does not take into consideration the other applications with which it must interact. Corporations end up with a whole collection of applications, each one of which acts independently. The result is unintegrated application data.
Corporate data are data that have entered the system and then have been transformed into an integrated corporate state. The transformation moves the data from being application-oriented data to a data warehouse where the data are integrated into a corporate state. As a simple example of corporate integration, application A has gender as male/female, application B has gender designated as x/y, and application C has gender designated as 1/0. The corporate standard for the designation of gender is m/f. The application data are converted as they were moved into the data warehouse from the application.
The data marts contain data that are customized for the different groups that will be analytically using the data. Typically, there are data marts for marketing, sales, finance, and others. The source of data for the data marts is the data warehouse.
The data lake contains a variety of data. Some of the data found in the data lake are archival data. Other data in the data lake are simply bulk data. And it is possible to build a bulk data warehouse in the data lake. In addition, the bulk data warehouse may contain a bulk data vault. The bulk data warehouse is the single version of the truth for bulk amounts of data.
The data ponds are the subsets of the data lake that are set aside for different purposes. There may be an archival data pond, a litigation support data pond, a general purpose data pond, a manufacturing data pond, an analog data pond, and so forth.
Each of the different types of data in the end-state architecture is shaped by different types of data models. There are different kinds of data models that are suited to the different environments. The data model that is found in the many places where data are shaped throughout the architecture serves as an intellectual information paradigm for the building of applications, data warehouses, data marts, etc.
Fig. 2.1.3 shows the different kinds of data models that are found in the end-state architecture.
Applications are typically shaped by functional decompositions and data flow diagrams. The data found in text are shaped by taxonomies. The data warehouse is shaped by the corporate data model, usually consisting of an entity relationship diagram (ERD), a data item set (dis), and a physical model. The data marts are shaped by the dimensional model, consisting of star joins, fact tables, and dimensions. The data vault is shaped by the data vault data model.
The data lake is shaped by the selective subdivision of data.
One of the important questions that quickly arises is are there two data warehouses—a standard data warehouse and a bulk data warehouse? Fig. 2.1.4 outlines this question.
The answer to that question is a little less than straightforward. From a physical standpoint, there are indeed two data warehouses—a standard data warehouse and a bulk data warehouse. But from a logical standpoint, there is only one data warehouse. The physical possibilities for a data warehouse are the following:
The confusion arises when a data warehouse is built inside a data lake, as is certainly a possibility. The data lake resides on physically different technology (i.e., big data) than the standard data warehouse (which typically resides on relational technology).
However, even though there are physically two different data warehouses, there should never be any overlap of data from the standard data warehouse to the bulk data warehouse. Therefore, there is logically one data warehouse that is physically implemented over two environments.
There are several advantages to this “duplexed” approach. One advantage is that the data warehouse can grow to any size. Another advantage is that the data warehouse infrastructure cost is minimized. Both of these advantages are quite attractive to most organizations.
Yet, another way to understand the end-state architecture is to look at the different types of questions that are answered in different places in the end-state architecture.
Fig. 2.1.5 shows the possibilities.
The raw text is captured and analyzed when someone asks the question—“can I get a loan?” The question itself becomes the basic data that go into the database.
Operational transaction questions relate to specific instances and values of data. When you say “what is my account balance?” you want to know exactly how much money you have in your account right now. You want the correct answer, and you want the answer very quickly.
Now, suppose you want to know your average monthly account balance for the past 5 years. That data will not be in the online application database. Instead, you need to look for specific data over time. The place to find that data is the data warehouse.
Suppose you want to examine the spending habits of all customers who deposit more than $1000 a month in their account. You need to look at all of that data in order to satisfy a special study. You might look for these data in a data mart. The processing you do here is of an analytic nature.
Now, suppose you are being audited by the IRS. You need to go back 10 years to show that a check was written a decade ago. You would go to your bulk data warehouse in the data lake.
The factors that determine where data are placed include the following:
Data in different places have different properties. And those properties affect their usage.
There can exist different kinds of data in the data lake. There are several reasons why data are placed in a data lake:
Accordingly, data are placed in the data lake.
However, just because data have been placed in a data lake does not mean that the data are (or are not) in a data warehouse. It is entirely possible that the data warehouse has been extended into the data lake.
Fig. 2.1.6 shows the data in the data lake.
It is not obvious when looking at the end-state architecture, but there is another important part of the architecture that is transparent. That part of the architecture is the metadata infrastructure that overlays each component of the end-state architecture.
The metadata are descriptive of the data that lie within the end-state architecture. The metadata are useful to designers, programmers, and end users. In a word, anyone who must find their way around the architecture needs to use the metadata.
It is noteworthy that each component has its own metadata and that both the component and the metadata for that component are different from one component to the next. In other words, the metadata for text look different than the metadata for applications, which also are different from the metadata for the data warehouse, and so on.
Fig. 2.1.7 shows the metadata infrastructure associated with the end-state architecture.
Another feature of the metadata infrastructure is that the metadata infrastructure is networked. A person looking at one collection of metadata can easily traverse to another collection of metadata. And—if desired—the analyst can exchange metadata from one metadata collection to the next.
Fig. 2.1.8 shows the ability to network the metadata across the architecture.
Another typical question that arises with the end-state architecture is how it is built. In a word, arriving at the end-state architecture is an evolutionary experience. No one sits down and builds all of the end-state architecture at once. Such an undertaking is too large, too complex, and too expensive. Instead, the end-state architecture grows over time.
Some organizations start building one direction. Other organizations start building another direction. Some organizations build part of the end-state architecture and never build other parts.
There is no one “correct” path to the evolution of the architecture.
Fig. 2.1.9 shows that there are many paths to the building of the end-state architecture.
The architecture component surrounding the data lake deserves a deeper explanation. In front of the data lake is a mechanism for capturing and prepping the data about to enter the data lake from external sources of data. There are several reasons for the need for an elaborate interface. The primary reasons for the need for an ingestion interface are the following:
Fig. 2.1.10 shows the data lake infrastructure.