Chapter 8.1

A Brief History of Data Architecture

Abstract

Data architecture began with simple storage devices. But soon, the need to store lots of data and to access the data quickly caused these early devices to disappear. In its place came disk storage. With disk storage, data could be accessed directly. But the need for managing volumes of data surpassed that of disk storage. One day, there appeared big data. And with big data came the ability to store effectively unlimited amounts of data. But as big data grew, the older day-to-day systems did not go away. There began to be a need for a rational way to interface legacy systems to big data.

Keywords

Storage device; Paper tape; Punched cards; Disk storage direct access of data; Big data; Interfacing corporate data and big data

Data have been around since the first computer program was written. In many ways, data are the gasoline that fuels the engine of the computer. The way that data are used, the way data are shaped, and the way that data are stored has progressed to the point that there is actually now an area of study that can be called data architecture.

There are many facets to data architecture because—as we shall see—data are complex. The four most interesting aspects of data architecture are the following:

  • - The physical manifestation of data
  • - The logical linkage of data
  • - The internal format of data
  • - The file structure of data

Each of these aspects of data has evolved interdependently over time. Data architecture can best be explained in terms of the evolution of each of these aspects of data architecture.

The evolution of data architecture is seen in Fig. 8.1.1.

Fig. 8.1.1
Fig. 8.1.1 The world of data architecture.

The simplest evolution that has occurred (and has been described in many places) is that of the physical evolution of the media on which data have been stored. Fig. 8.1.2 shows this well-documented evolution.

Fig. 8.1.2
Fig. 8.1.2 The physical dimension of data architecture.

The beginning of the computer industry harks back to paper tape and punched cards. In the very earliest days, data were stored by means of paper tape and punched cards. The value of paper tape and punched cards was that it was easy to create storage. But there were many problems with paper tape and cards. Hollerith punched cards were fixed format only (everything was stored in 80 columns). Cards were dropped and soiled. Cards could not be repunched. And all things considered cards were expensive.

Only so much data could be stored on cards. Very quickly, an alternative to punched cards was needed.

Fig. 8.1.3 shows that punched cards and paper tape were early storage mechanisms for data.

Fig. 8.1.3
Fig. 8.1.3 Punched cards and paper tape.

Next came magnetic tape. Magnetic tape could store much more data than could ever be stored on punched cards. And magnetic tape was not limited to the single format of a punched card. But there were some major limitations to magnetic tape. In order to find data on a magnetic tape file, you had to scan the entire file. And the oxide on magnetic tape files was notoriously unstable.

Magnetic tape file represented a major step forward from punched cards. But magnetic tape files had their own serious limitations.

Fig. 8.1.4 shows the symbol for magnetic tape files.

Fig. 8.1.4
Fig. 8.1.4 Magnetic tape file.

After magnetic tape files came disk storage. With disk storage, data could be accessed directly. No longer was it necessary to search the entire file to find a single record.

The early forms of disk storage were expensive and slow. And there was relatively little capacity for the early forms of disk storage. But quickly, the costs of manufacturing dropped significantly, the capacity increased, and the speed of access decreased. Disk storage was a superior alternative to magnetic tape files.

Fig. 8.1.5 shows the symbol for disk storage.

Fig. 8.1.5
Fig. 8.1.5 Disk storage.

The demand for volumes of data increased dramatically. In short order, it was necessary to manage disk storage in a parallel manner. By managing disk storage in a parallel manner, the total amount of data that could be controlled increased significantly. Parallel management of storage did not increase the volume of data that could be managed on a single disk. Instead, parallel storage of data decreased the total elapsed time that was required to access and to manage storage.

Fig. 8.1.6 shows the symbol for parallel management of storage.

Fig. 8.1.6
Fig. 8.1.6 Parallel disk storage.

Yet, another increase in the volume of data that could be managed on disk arrived in the form of big data. Big data was really just another form of parallelism. But with big data, even more data could be managed at an increasingly lower unit cost.

Fig. 8.1.7 shows the symbol for big data.

Fig. 8.1.7
Fig. 8.1.7 Big data.

Over the years then, the total amount of data that could be managed, at an amazing decrease in the unit storage of data, with an ever-increasing speed of access, has evolved.

But the physical storage of data was hardly the only evolution that was occurring. Another concurrent evolution that was occurring was the evolution of the way that data were logically organized. It is one thing to physically store data. It is another thing to logically organize data so that it can be easily and rationally accessed.

Fig. 8.1.8 shows the evolutionary progression of the logical organization of data.

Fig. 8.1.8
Fig. 8.1.8 The logical linkage of data.

In the very earliest days, data were logically organized in almost a random fashion. Every programmer and every designer “did his/her own thing.” To say that the world was in chaos when it came to logical organization of data was an understatement.

Into this world of chaos came Ed Yourdon and Tom DeMarco. Yourdon espoused a concept called the “structured” approach. (NOTE that the term of “structured” as used by Yourdon is quite different; then, the same term is used in describing the internal formatting of data. When Yourdon used the term “structured,” he was referring to a logical and organized way of arranging information systems. Yourdon was referring to programming practices, design of systems, and many other aspects of information systems. The term “structured” is also used in describing the internal formatting of data. Even though the terms that are used are the same, they mean something quite different.)

In Yourdon's approach to structured systems, one of the aspects of structured was in reference to how data elements should be logically organized in order to create a disciplined system approach for the building of information systems. Prior to Yourdon, there were many schemes for the logical organization of data.

Fig. 8.1.9 is a symbol depicting the Yourdon approach to structured programming and development.

Fig. 8.1.9
Fig. 8.1.9 A networked data structure.

A while later came the idea of database management systems as a means of logically organizing data. With the DBMS came the idea of organizing data hierarchically and in a network. An early hierarchical organization of data was used by IBM's IMS. An early form of network organization of data was Cullinet's IDMS.

In the hierarchical organization of data was the notion of a parent/child relationship. A parent could have zero or more children. And a child had to have a parent.

Fig. 8.1.10 depicts a diagram that has a parent-child relationship and a networked relationship.

Fig. 8.1.10
Fig. 8.1.10 Different relationships.

The DBMS were useful for organizing data both for batch processing and for online transaction processing. Many systems were built running transactions under the DBMS.

Soon, there came another notion about the way data should be logically organized. That method was through what was termed a relational database management system.

In the relational database management system, data were “normalized.” Normalization meant that there was a primary key for each table and the attributes in the table depended on the key of the table for their existence. The tables were able to be related to each other by means for a key/foreign key relationship. Upon access of the tables, the tables could be “joined” by means of pairing up the appropriate key and foreign key.

Fig. 8.1.11 shows a relational table.

Fig. 8.1.11
Fig. 8.1.11 Some related relational tables.

As interesting and as important as the logical organization of data is, it was not the only aspect of data architecture.

Another aspect of data architecture is that of the internal formatting of data. When you look at the logical organization of data, all the DBMS was applied to what is called “structured” data. The structured data imply that there is some way that the computer can comprehend the way the data are organized. The structured way of organizing data applies to many aspects of the corporation. The structured approach is used for organizing customer information, product information, sales information, accounting information, and so forth. The structured approach is used for capturing transaction information.

The unstructured approach is for data that are not organized in a manner that is intelligible to the computer. The unstructured approach applies to images, audio information, downloads from satellites, and so forth. But far and away, the biggest use of the unstructured approach is for textual data.

Fig. 8.1.12 shows the evolution of the internal formatting of data.

Fig. 8.1.12
Fig. 8.1.12 Internal formatting of data architecture.

The structured approach implies that the data are organized enough to be able to be defined to a database management system. Typically, the DBMS has attributes of data, keys, indexes, and records of data. The “schema” of the data is determined as the data are loaded. Indeed, the content of the data and its place in the schema dictate where and how the data are loaded.

Fig. 8.1.13 illustrates data loaded in a structured format.

Fig. 8.1.13
Fig. 8.1.13 A classical index.

The unstructured internal organization of data contains all sorts of data. There are a wide variety of data here. The unstructured internal organization of data includes e-mail, documents, spreadsheets, analog data, log tape data, and many varieties of data.

Fig. 8.1.14 shows unstructured internally organized data.

Fig. 8.1.14
Fig. 8.1.14 Unstructured data.

The world of unstructured data is a world where there is a basic division between repetitive and nonrepetitive data.

Repetitive unstructured data are data where the data are organized in many records where the structure and content of the records are very similar or even the same.

Fig. 8.1.15 shows repetitive unstructured data.

Fig. 8.1.15
Fig. 8.1.15 Repetitive unstructured data.

The other kind of data found in the unstructured environment is that of nonrepetitive data. With nonrepetitive data, there is no correlation between one record of data and any other. If there is a similarity of data between any two records of data in the nonrepetitive environment, it is purely a random event.

Fig. 8.1.16 depicts the nonrepetitive unstructured environment.

Fig. 8.1.16
Fig. 8.1.16 Nonrepetitive unstructured data.

Yet, another aspect of data is the file organization of the data. Starting with a very simple file organization, the world has progressed to a very elaborate and sophisticated organization of data.

Fig. 8.1.17 shows the evolution of file structures of data.

Fig. 8.1.17
Fig. 8.1.17 Data architecture file structures.

In the early days was very crude and simple file organization. Soon, the vendors of technology recognized that a more formal approach was needed. Thus, born were simple files, as seen in Fig. 8.1.18.

Fig. 8.1.18
Fig. 8.1.18 A file being written by an application.

In Fig. 8.1.18 were found very simple files. These files were simple collections of data organized; however, the designer thought they needed to be organized. In almost every case, the files were designed to be optimized around the needs of an application.

But soon, it was recognized that the same or very similar information was being collected by more than one application. It was recognized that this overlap of effort was both wasteful and resulted in redundant data being collected and managed. The solution was the creation of a master file.

The master file was a place where data could be gathered in a nonredundant manner.

Fig. 8.1.19 shows a master file.

Fig. 8.1.19
Fig. 8.1.19 A magnetic tape file being written by an application.

The master file was a good idea and worked well. The only problem was that a master file existed on a tape file. And tape files were clumsy to use. Soon, the idea of a master file evolved into the idea of a database. Thus, born was the database concept, as seen in Fig. 8.1.20.

Fig. 8.1.20
Fig. 8.1.20 Disk storage being written by an application.

The early notion of a database was as a “place where all data resided for a subject area.” In a day and age where lots of data were still lying in files and master files, the idea of a database was an appealing approach. And given that data could be accessed on disk storage in a database, the idea of a database was especially appealing.

Soon, however, the database concept morphed into the online database concept. In the online database concept, not only could data be accessed directly but also could be accessed in a real-time, online mode. In the online, real-time mode, data could be added, deleted, and changed as business needs changed.

Fig. 8.1.21 depicts the online, real-time environment.

Fig. 8.1.21
Fig. 8.1.21 Online transaction processing, OLTP.

The online database environment opened up computing to parts of the business where never before had any interaction been possible. Soon, there were applications everywhere. And in short order, the applications spawned what was known as the spider's web environment.

With the spider's web environment came the need for integrity of data. Soon, the concept of the data warehouse arose.

Fig. 8.1.22 shows the advent of the data warehouse.

Fig. 8.1.22
Fig. 8.1.22 Transforming application data to corporate data.

The data warehouse provided the organization with the “single version of the truth.” Now, there was a basis for reconciliation of data. With the data warehouse—for the first time—came a place where historical data could be stored and used. The data warehouse represented a fundamental step forward for the information processing systems of the organization.

As important as the data warehouse was, there were other elements of architecture that were needed. It was soon recognized that something between a data warehouse and a transactional system was needed. Thus, born was the ODS or operational data store.

Fig. 8.1.23 shows the ODS.

Fig. 8.1.23
Fig. 8.1.23 Sometimes there is a need for an operational data store (ODS).

The ODS was a place where online high-performance processing could be done on corporate data. Not every organization had need of an ODS, but many did.

At about the time that the data warehouse was born, it was recognized that organizations had need of a place where individual departments could go to and find data for their own individual analytic needs. Into this analytic environment came the data mart or the dimensional model.

Fig. 8.1.24 shows the star join, the foundation of the data mart.

Fig. 8.1.24
Fig. 8.1.24 A star join.

It was then recognized that a more formal treatment of data marts was needed than just having dimensional models. The notion of the dependent data mart was then created.

Fig. 8.1.25 shows dependent data marts.

Fig. 8.1.25
Fig. 8.1.25 The integrated analytical environment.

The evolution that has been described did not happen independently. The evolution happened concurrently. Indeed, the evolution to some levels of development depended on evolutionary developments that occurred in other arenas. For example, the evolution to online databases could not have occurred until the technology that supported online processing had been developed.

Or the movement to data warehouses could not have occurred until the cost of storage dropped to an affordable rate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset