Chapter 1.7

A Brief History of Data

Abstract

Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the “great divide.” The divide is so large; many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.

Keywords

Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data

No book on data architecture would be complete without a narrative regarding the advances made in the technology of data.

In the beginning were wired boards. These hand-wired boards were “plug-ins” to an early rendition of the computer. The hardwired connections directed the computer as to how data were to be treated.

Paper Tape and Punch Cards

But wired boards were clumsy and error prone and could handle only small volumes of data (very small volumes of data!). Soon, an alternative was paper tape and punched cards. Paper tape and punched cards were able to handle larger volumes of data. And there was a greater range of functions that could be handled with punched cards and paper tape. But there were problems with paper tape and punched cards. When a programmer dropped a deck of cards, it was a very laborious activity to reconstruct the sequence of the cards. And once a card was punched, it was next to impossible to make a change to the card (although in theory it could be done).

Another shortcoming was that a relatively small amount of data could be held in this media.

Fig. 1.7.1 depicts the media of cards and paper tape.

Fig. 1.7.1
Fig. 1.7.1 Punched cards and paper tape.

Magnetic Tapes

Quickly replacing paper tape and punched cards was the magnetic tape. The magnetic tape was an improvement over the paper tape and punched cards. With a magnetic tape, a much larger volume of data could be stored. And the record size that could be stored on a magnetic tape was variable. (Previously, the record size stored on a punched card was fixed.) So, there were some important improvements made by magnetic tape.

But there were limitations that came with magnetic tapes. One limitation was that the magnetic tape file had to be accessed sequentially. This meant that the analyst had to sequentially search through the entire file when looking for a single record. Another limitation of the magnetic tape file was that over time, the oxide on the tape stripped away. And once the oxide was gone, the data on the tape were irretrievable.

Despite the limitations of the magnetic tape file, the magnetic tape file was an improvement over punched cards and paper tape.

Fig. 1.7.2 shows a magnetic tape file.

Fig. 1.7.2
Fig. 1.7.2 Magnetic tape.

Disk Storage

The limitations of the magnetic tape file were such that soon there was an alternative medium. That alternative medium was called disk storage (or direct access storage). Direct access storage held the great advantage that data could be accessed directly. No longer was it necessary to read an entire file in order to access just one record. With disk storage, it was possible to go directly to a unit of data.

Fig. 1.7.3 shows disk storage.

Fig. 1.7.3
Fig. 1.7.3 Disk storage.

At first, disk storage was expensive, and there wasn’t all that much capacity that was available. But the hardware vendors quickly improved on the speed, the capacity, and the cost of disk storage. And the improvements have continued until today.

Data Base Management System (DBMS)

Along with the advent of disk storage came the appearance of the database management system (DBMS). The database management system controlled the placement, access, update, and deletion of data on disk storage. The DBMS saved the programmer from doing repetitive and complex work.

With the appearance of the DBMS came the ability to tie processors to the database (and disk). Fig. 1.7.4 shows the advent of the DBMS and the close coupling of the database with the computer.

Fig. 1.7.4
Fig. 1.7.4 Uniprocessor architecture.

At first, a simple uniprocessor architecture sufficed. In a uniprocessor architecture, there was an operating system, the DBMS, and an application. The early computers managed all these components. But in short order, the capacity of the processor was stretched. It was at this point that the capacity considerations of storage switched from improvements on the storage technology to improvements on the management of the storage technology. Prior to this point in time, the great leaps forward in data had been made by improving the storage media. But after this point in time, the great leaps forward were made architecturally, at the processor level.

Soon, the uniprocessor simply ran out of capacity. The consumer could always buy a bigger faster processor, but soon, the consumer was surpassing the capacity of the largest uniprocessor.

Coupled Processors

The next major advance was the tight coupling together of multiple processors. Fig. 1.7.5 shows the coupling together of multiple processors.

Fig. 1.7.5
Fig. 1.7.5 Multiplexed architecture.

By coupling together multiple processors, the processing capacity automatically increased. The ability to couple the processors together was made possible by the sharing of memory across the different processors.

Online Transaction Processing

With the advent of greater processing power and the control of a DBMS, it was now possible to create a new kind of system. The new kind of system was called the online real-time system. The processing done by this type of system was called OLTP, or online transaction processing.

Fig. 1.7.6 shows an online real-time system.

Fig. 1.7.6
Fig. 1.7.6 Online real time architecture.

With online real-time processing, it was now possible to use the computer in a manner that had not before been possible. With the online real-time processing system, the computer could now be used interactively. The business could now be engaged with the usage of the computer in a manner not before possible. Suddenly, there were airline reservation systems, bank teller systems, ATM systems, inventory management systems, car reservation systems, and many, many more systems. Once real-time online processing became a reality, the computer was used in business as never before.

And with the explosive growth in the usage of the computer, there was an explosive growth in amount of data and types of data that were being created. With the flood of data came the desire to have integrated data. No longer was it sufficient to merely have data from an application. With the flood of data came the need to look at data in a cohesive manner.

Data Warehouse

Thus, born was the data warehouse, as seen in Fig. 1.7.7.

Fig. 1.7.7
Fig. 1.7.7 Data warehouse architecture.

With the data warehouse came what was called the single version of the truth or the system of record. With the single version of the truth, the organization now had a foundation of data that the organization could turn to with confidence.

The volumes of data continued to explode with the advent to the data warehouse. Prior to the data warehouse, there was no convenient place to store historical data. But with data warehouse, for the first time, there was a convenient and natural place for historical data.

Parallel Data Management

It is normal and natural that with the ability to store large amounts of data, the demand for data management products and technology skyrocketed. Soon, there emerged an architectural approach called the parallel approach to data management.

Fig. 1.7.8 illustrates the parallel approach to data management.

Fig. 1.7.8
Fig. 1.7.8 Parallel architecture.

With the parallel approach to data management, a huge amount of data could be accommodated. Far more data could be managed in parallel than was ever possible with nonparallel techniques. With the parallel approach, the limiting factor as to how much data could be managed was an economic limitation, not a technical limitation.

Data Vault

As data warehouses grew, it was realized that there needed to be flexibility in the design of the data warehouse and in the improvement in the integrity of data. Thus, born was the data vault, as seen in Fig. 1.7.9.

Fig. 1.7.9
Fig. 1.7.9 Data vault architecture.

With data vault, the data warehouse now enjoyed the ultimate in design and integrity.

Big Data

But volumes of data continued to increase. Soon, there were systems that went beyond the capacity of even the largest parallel database. A new technology known as big data evolved in which the optimization of the data management software was on the volumes of data to be managed, not on the ability to access data in an online manner.

Fig. 1.7.10 depicts the arrival of big data.

Fig. 1.7.10
Fig. 1.7.10 Big data architecture.

With big data came the advent of the ability to capture and store an almost unlimited amount of data. The arrival of the ability to handle massive amounts of data brought with it the need for a completely new infrastructure.

The Great Divide

And with the recognition of the need for a new infrastructure came the recognition that there were two distinctly different types of big data. There is repetitive big data, and there is nonrepetitive big data. And both repetitive big data and nonrepetitive big data required dramatically different infrastructure.

Fig. 1.7.11 shows the recognition of the difference between repetitive big data and nonrepetitive big data.

Fig. 1.7.11
Fig. 1.7.11 The great divide.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset