Chapter 4.2

What Is Big Data?

Abstract

There are different definitions of big data. The definition used here is that big data encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman census” method, and stores data in an unstructured format. There are two major types of big data—repetitive big data and nonrepetitive big data. Only a small fraction of repetitive big data has business value, whereas almost all of nonrepetitive big data has business value. In order to achieve business value, the context of data in big data must be determined. Contextualization of repetitive big data is easily achieved. But contextualization of nonrepetitive data is done by means of textual disambiguation.

Keywords

Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data; Contextualization; Textual disambiguation

The definition of big data as defined by Gartner Group is

  • volume,
  • velocity,
  • variety.

While this definition is often quoted and used on a widespread basis, it is not a definition at all. The load handled by a semitruck going down the highway fits this definition and the cargo of an ocean liner. In fact, there are many things that fit this definition other than big data.

Another Definition

The problem with the Gartner definition is that it describes some of the characteristics of big data, but it does not disclose the identifying characteristics.

The definition of big data that we will use for this book is as follows:

  • Big data is
    • data that is stored in very large volumes,
    • data that is stored on inexpensive storage,
    • data that is managed by the “Roman census” method,
    • data that is stored and managed in an unstructured format.

These then are the defining characteristics of big data that will be used in this book.

Each of these characteristics deserves a more elucidating explanation.

Large Volumes

Most organizations already have an adequate amount of data to run day-to-day business. But some organizations have an extraordinary amount of data. Some organizations have a need to look at such things as the following:

  • All the data on the Internet
  • Meteorologic data sent down by a satellite
  • All of the e-mails in the world
  • Manufacturing data generated by an analog computer
  • Railroad cars as they traverse tracks
  • Many more applications

For these organizations, there is no good and inexpensive way to store and manage data. Even if the data could be stored in a standard DBMS, the cost of storage would be exorbitantly high. So for some organizations, there is a need to store and manage very large amounts of data.

When facing the issue of managing very large amounts of data, there is the issue of business value that arises. The fundamental question of “what business value is there in being able to look at massive volumes of data?” needs to be addressed. The old saw of “build it and they will come” does not apply to large amounts of data. Before the organization sets out to store massive amounts of data, there needs to be a good understanding of what business value of data lies in the data itself.

Inexpensive Storage

Even if big data were able to store and manage massive amounts of data, it would not be practical to create huge stores if the storage medium that was used was expensive storage. Stated another way, if big data stored data on only expensive high-performance storage, the cost of big data would be prohibitive. In order to be a practical and useful solution, big data, of necessity, must be able to use inexpensive storage.

The Roman Census Approach

One of the cornerstones of big data architecture is processing referred to as the “Roman census approach.” By using the Roman census approach, a big data architecture can accommodate the processing of almost unlimited amounts of data.

When people first hear the “Roman census approach,” it appears to be counterintuitive and unfamiliar. The reaction most people have is “and just exactly what is a Roman census approach?” Yet, the approach—architecturally—is at the core of the functioning of big data. And—surprisingly—it turns out that many people are much more familiar with the Roman census approach than they ever realized.

Once upon a time—about 2000 years ago—the Romans decided that they wanted to tax everyone in the Roman empire. But in order to tax the citizens of the Roman empire, the Romans first had to have a census. The Romans quickly figured out that trying to get every person in the Roman empire to march through the gates of Rome in order to be counted was an impossibility. There were people in North Africa, in Spain, in Germany, in Greece, in Persia, in Israel, in England, and so forth. Not only were there a lot of people in faraway places; trying to transport everyone on ships and carts and donkeys to and from the city of Rome was simply an impossibility.

So, the Romans realized that creating a census where the processing (i.e., the counting and the taking of the census) was done centrally was not going to work. The Romans solved the problem by creating a body of “census takers.” The census takers were organized in Rome and then were sent all over the Roman empire, and on the appointed day, a census was taken. Then, after taking the census, the census takers headed back to Rome where the results were tabulated centrally.

In such a fashion, the work being done was sent to the data, rather than trying to send the data to a central location and doing the work in one place. By distributing the processing, the Romans solved the problem of creating a census over a large diverse population.

Many people don't realize that they are very familiar with the Roman census method and don't know it. You see, there once was a story about two people—Mary and Joseph—who had to travel to a small city, Bethlehem, for the taking of a Roman census. On the way there, Mary had a little baby boy—named Jesus—in a manger. And the shepherds flocked to see this baby boy. And Magi came and delivered gifts. Thus, born was the religion many people are familiar with—Christianity. The Roman census approach is intimately entwined with the birth of Christianity.

The Roman census method then says that you don't centralize processing if you have a large amount of data to process. Instead, you send the processing to the data. You distribute the processing. In doing so, you can service the processing over an effectively unlimited amount of data.

Unstructured Data

Another issue related to big data is that of whether big data is structured or unstructured. In many circles, it is said that all big data is unstructured. In other circles, it is said that big data is structured.

So who is right? As we shall see, the answer lies entirely in how you define the term “structured” and “unstructured.”

So what does “structured” mean? One widely used definition of structured is that anything managed by a standard DBMS is structured. Fig. 4.2.1 shows some data managed by a standard database management system.

Fig. 4.2.1
Fig. 4.2.1 A standard data base structure.

In order to load the data into the DBMS, there needs to be a careful definition of the logical and the physical characteristics of the system. All data—attributes, keys, indexes, etc.—need to be defined before the data can be loaded into the system.

The notion of structure meaning “able to be managed under a standard DBMS” is a very widely used understanding of what is meant by structured. The meaning has been around for a long time and is widely understood by a large body of people.

Data in Big Data

Now, consider what data looks like when it is stored in big data. There is none of the definitional infrastructure that is found in a standard DBMS. All sorts of data are stored in big data, and they are stored with no notion of what the structure of the data looks like.

Fig. 4.2.2 shows data stored in big data.

Fig. 4.2.2
Fig. 4.2.2 Big data.

If the definition of structured is taken to mean “managed by a standard DBMS,” then the data stored in big data is definitely unstructured.

However, there are different interpretations of what is meant by the term “structured.” Consider the (very normal) circumstance of big data consisting of many repetitive records. Fig. 4.2.3 shows that big data can certainly contain blocks of data that are made up of many repetitive records. There are many instances where big data contains just this sort of information. Some of the many instances include the following:

  • Click stream data
  • Metered data
  • Telephone call record data
  • Analog data
  • Many more types of repetitive records
Fig. 4.2.3
Fig. 4.2.3 Different types of data.

When there are repetitive records, the same structure of data is repeated over and over, from one record to the next. And often times, the same value of data is repeated as well.

When repetitive records are found in big data, there is no index facility as there is in a standard DBMS. But there still is indicative data in big data even if it is not managed by an index.

Context in Repetitive Data

Fig. 4.2.4 shows that inside the repetitive record inside big data, there is information that can be used to identify the record. Sometimes, this information is known as context.

Fig. 4.2.4
Fig. 4.2.4 Context.

In order to find this information in the record, the record must be parsed in order to determine its value. But the fact is that the information is there, inside the record.

And when you look at all of the repetitive records inside the big data storage blocks, the same type of data is in each record, in precisely the same format. Fig. 4.2.5 shows that the repetitive records have the same identifying information in exactly the same structure.

Fig. 4.2.5
Fig. 4.2.5 Repetitive records have the same context.

From the standpoint of repetitiveness and predictability, big data indeed has very structured data inside it.

So in answer to the question does big data have structure?—if you look at the question from the standpoint of structure meaning a structured DBMS infrastructure, then big data does not contain structured data. But if you look at big data from the standpoint of containing repetitive data with predictable context, then big data can be said to be structured.

The answer to the question then is neither yes nor no. The answer to the question depends entirely on the definition of what is meant by structured and unstructured.

Nonrepetitive Data

Even if big data can contain structured data, big data can also contain what is called “nonrepetitive” data as well. Nonrepetitive records of data are records where the structure and content of the records are entirely independent of each other. Where there is nonrepetitive data, it is entirely an accident if any two records resemble each other, either in content or structure.

There are many examples of nonrepetitive data. Some examples of nonrepetitive data include the following:

  • E-mails
  • Call center information
  • Health-care records
  • Insurance claim information
  • Warranty claim information

Nonrepetitive information contains indicative information. But the indicative information found in nonrepetitive records is very erose. There simply is no pattern to the contextual information found in nonrepetitive data.

Context in Nonrepetitive Data

Fig. 4.2.6 shows that the blocks of data found in the big data environment that are nonrepetitive are very irregular in shape, shape, and structure.

Fig. 4.2.6
Fig. 4.2.6 Nonrepetitive data.

There is contextual data found in the nonrepetitive records of data. But the contextual data must be extracted in a very customized manner (Fig. 4.2.7).

Fig. 4.2.7
Fig. 4.2.7 Context—found in different places and in different ways.

Context is found in nonrepetitive data. However, context is not found in the same manner and in the same way that it is found in using repetitive data or classical structured data found in a standard DBMS.

In later chapters, the subject of textual disambiguation will be addressed. It is through textual disambiguation that context in nonrepetitive data is achieved.

There is another way to look at the repetitive and the nonrepetitive data found in big data. That perspective is shown in Fig. 4.2.8.

Fig. 4.2.8
Fig. 4.2.8 Repetitive data and nonrepetitive data in big data.

In Fig. 4.2.8, it is seen that the vast majority of the volume of data found in big data is typically repetitive data. Fig. 4.2.8 shows that nonrepetitive makes up only a fraction of the data found in big data, when examined from the perspective of volume of data.

However, Fig. 4.2.9 shows a very different perspective.

Fig. 4.2.9
Fig. 4.2.9 A different perspective.

Fig. 4.2.9 shows that from the perspective of business value that the vast majority of value found in big data lies in nonrepetitive data.

There is then a real mismatch between the volume of data and the business value of data. For people that are examining repetitive data and hoping to find massive business value there, there is most likely disappointment in their future. But for people looking for business value in nonrepetitive data, there is a lot to look forward to.

When you compare looking for business value in repetitive and nonrepetitive data, there is an old adage that applies here. That adage is that “90% of the fishermen fish where there are 10% of the fish.” The converse of the adage is that “10% of the fishermen fish where 90% of the fish are.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset