Chapter 1.2

The Data Infrastructure


Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the “great divide.” The divide is so large; many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.


If there is any secret to data management and data architecture, it is understanding data in terms of its infrastructure. Stated differently, trying to understand the larger architecture under which data are managed and operate is almost impossible without understanding the underlying infrastructure, which surrounds data. Therefore, we shall spend some time understanding infrastructure.

Two Types of Repetitive Data

A good starting point for understanding infrastructure is to start with the observation that there are two types of repetitive data found in corporate data. In the structured side of corporate data, repetitive data are found. In the unstructured big data side of corporate data, repetitive data are also found. Despite the fact that the types of data sound the same, there are significant differences between the different types of repetitive data. When it comes to structured repetitive data, it is normal to have transactions as part of the repetitive data. There are sales transactions, stocking of SKU transactions, inventory replenishment transactions, payment transactions, and so forth. In the structured world, there are many of these transactions that find their way into the repetitive structured world.

The other kind of repetitive data is the repetitive data found in the unstructured big data world. In the unstructured big data world, we might have metering data, analog data, manufacturing data, clickstream data, and so forth.

There is the question then—are these types of repetitive data the same? They certainly are repetitive. But these different types of repetitive data are not the same. What is the difference then between these two types of repetitive data? Fig. 1.2.1 shows (symbolically) these two types of repetitive data.

Fig. 1.2.1
Fig. 1.2.1 Two types of repetitive data.

Repetitive Structured Data

In order to understand the differences between these two types of repetitive data, it is necessary to understand each type of data individually. Let's start with repetitive structured data. Fig. 1.2.2 shows the repetitive structured data are broken into records and blocks.

Fig. 1.2.2
Fig. 1.2.2 Repetitive data broken into blocks.

The most basic unit of information in the repetitive structured environment is a block of data. Inside each block of data are records of data.

Fig. 1.2.3 shows a simple record of data.

Fig. 1.2.3
Fig. 1.2.3 Records inside a block.

Each record of data is (normally!) representative of a transaction. For example, there are records of data representing the sale of a product. Each record is representative of a single sale.

Inside each record are keys, attributes, and indexes. Fig. 1.2.4 shows the anatomy of a record.

Fig. 1.2.4
Fig. 1.2.4 Attributes, keys, and indexes.

If a record is representative of a sale, the attributes might be information about the date of the sale, the item sold, the cost of the item, any tax on the item, who bought the item, and so forth. The key of the record is one or more attributes that uniquely define the record. The key for a sale might be the date of sale, item sold, and location of the sale.

The indexes that are attached to the record are on the attributes that are needed when there is a desire to have quick access to the record.

The infrastructure that is attached to structured repetitive data managed under a DBMS is seen in Fig. 1.2.5.

Fig. 1.2.5
Fig. 1.2.5 A standard DBMS.

Repetitive Big Data

The other type of repetitive data is repetitive data found in big data. Fig. 1.2.6 depicts the repetitive data found in big data.

Fig. 1.2.6
Fig. 1.2.6 Repetitive big data.

At first glance, there are just a lot of repetitive records seen in Fig. 1.2.6. But upon closer examination, it is seen that all of those repetitive big data records are packed away into a string of data and that string of data is stored inside a block of data, as seen in Fig. 1.2.7.

Fig. 1.2.7
Fig. 1.2.7 A block of data.

The structured infrastructure seen in Fig. 1.2.7 is typical of an infrastructure managed under one of several DBMS such as Oracle, SQL Server, and DB2.

The infrastructure for big data is quite different than the infrastructure found in a standard DBMS. In the infrastructure for big data, there is a block. And in the block are found many repetitive records. Each record is merely concatenated to each other record. Fig. 1.2.8 is representative of a record that might be found in big data.

Fig. 1.2.8
Fig. 1.2.8 Records inside the block.

In Fig. 1.2.8, it is seen that there is merely a long string of data, with records stacked one against the other. The system only sees the block and the long string of data. In order to find a record, the system needs to “parse” the string, as seen in Fig. 1.2.9.

Fig. 1.2.9
Fig. 1.2.9 Parsing records inside the block.

Suppose the system wants to find a given record. The system needs to sequentially read the string of data until it recognizes that there is a record. Then, the system needs to go into the record and determine whether it is record “B.” This is how a search is conducted in the most primitive state in big data.

It doesn’t take much of an imagination to see that a lot of machine cycles are chewed up looking for data in big data. To this end, the big data environment employs a means of processing referred to as the “Roman census” approach. More will be described about the Roman census approach in the chapter on big data.

The Two Infrastructures

The two different infrastructures are contrasted in Fig. 1.2.10.

Fig. 1.2.10
Fig. 1.2.10 Two different infrastructures.

Without much effort, it is seen that the infrastructures surrounding big data and structured data are quite different. The infrastructure surrounding big data is quite simple and streamlined. The infrastructure surrounding structured DBMS data is elaborate and anything but streamlined.

There is then no argument as to the fact that there are significant differences between the infrastructure of repetitive structured data and repetitive big data.

What's Being Optimized?

When looking at the two infrastructures, it is natural to ask—what is being optimized by the different infrastructures. In the case of big data, the optimization of the infrastructure is on the ability of the system to manage almost unlimited amounts of data. Fig. 1.2.11 shows that with the infrastructure of big data, adding new data is a very easy and streamlined thing to do.

Fig. 1.2.11
Fig. 1.2.11 Optimal for storing massive amounts of data.

But the infrastructure behind a structured DBMS is optimized for something quite different than managing huge amounts of data. In the case of the structured DBMS environment, the optimization is on the ability to find any one given unit of data quickly and efficiently.

Fig. 1.2.12 shows the optimization of the infrastructure of a standard structured DBMS.

Fig. 1.2.12
Fig. 1.2.12 Optimal for direct online access of data.

Comparing the Two Infrastructures

Another way to think of the different infrastructures is in terms of the amount of data and overhead required to find a given unit of data. In order to find a given unit of data, the big data environment has to search through a whole host of data. Many input/output operations (I/Os) have got to be done to find a given item. To find that same item in a structured DBMS environment, only a few I/Os need to be done. So if you want to optimize on the speed of access of data, the standard structured DBMS is the way to go.

On the other hand, in order to achieve the speed of access, an elaborate infrastructure for data is required by the standard structured DBMS. An infrastructure must be both built and maintained over time, as data change. A considerable amount of system resources is required for the building and maintenance of this infrastructure. But when it comes to big data, the infrastructure required to be built and maintained is nil. The big data infrastructure is built easily and maintained very easily.

This section began with the proposition that repetitive data can be found in both the structured and big data environment. At first glance, the repetitive data are the same or are very similar. But when you look at the infrastructure and the mechanics implied in the infrastructure, it is seen that the repetitive data in each of the environments are indeed very different.

