Chapter 8.2

Big Data/Existing System Interface

Abstract

Data architecture began with simple storage devices. But soon, the need to store lots of data and to access the data quickly caused these early devices to disappear. In its place came disk storage. With disk storage data could be accessed directly. But the need for managing volumes of data surpassed that of disk storage. One day, there appeared big data. And with big data came the ability to store effectively unlimited amounts of data. But as big data grew, the older day-to-day systems did not go away. There began to be a need for a rational way to interface legacy systems to big data.

Keywords

Storage device; Paper tape; Punched cards; Disk storage direct access of data; Big data; Interfacing corporate data and big data

One of the challenges of information systems is determining how they all fit together. In particular, how does big data fit with the existing system environment? There is no question that big data brings new opportunities for information and decision-making to the organization. And there is no question that big data has great promise. But big data is not a replacement for the existing system environment. In fact, big data accomplishes one task, and the existing system environment accomplishes another task. They are (or should be!) complementary to each other.

So exactly, how does big data need to interface with and interact with the existing system environment?

The Big Data/Existing Systems Interface

Fig. 8.2.1 shows the recommended way in which big data and existing systems interface with each other.

Fig. 8.2.1
Fig. 8.2.1 The big data/data warehouse interface.

Fig. 8.2.1 shows the overall system flow between big data and the existing system environment.

Each of the interfaces will be discussed in detail.

Raw big data is divided into two distinct sections (see the “great divide”). There is repetitive raw big data and nonrepetitive raw big data. Repetitive raw big data is handled entirely differently than nonrepetitive raw big data.

The Repetitive Raw Big Data/Existing Systems Interface

The interface from repetitive raw big data to existing system environment in some ways is the simplest interface. In many ways, this interface is like a distillation process. The mass of data found in raw repetitive big data is winnowed down—distilled—into the few records that are of interest.

The repetitive raw big data is processed by parsing each record. And when the records that are of interest are located, the records of interest are then edited and passed to the existing system environment. In such a fashion, the data that are of interest are distilled from the mass of records found in the raw repetitive big data environment. One assumption made by this interface is that the vast majority of records found in the repetitive component of raw big data will not be passed to the existing system environment. The assumption is that only a few records of interest are to be found.

In order to explain this assumption, consider a few cases.

Manufacturing—a manufacturer makes a product. The quality of the product is quite high. On the average, only one out of 10,000 products is defective. However, the defective products are still a bother. All the product manufacturing information is stored in big data. But only the information about the defective products is brought to the existing systems environment for further analysis. In this case, based on a percentage basis, very little data are brought to the existing system environment.

Telephone calls (call record details)—on a daily basis, millions of telephone calls are made. But of those millions of telephone calls, only a handful—maybe three or four—are of interest. Only the phone calls that are of interest are brought from the big data environment to the existing system environment

Log tape analysis—a log tape of transactions is created. In a day, tens of thousands of log tape entries are created. But only a few hundred entries on the log tape are of interest. Those few hundred log tape entries that are of interest are the only entries that find their way back into the existing system environment for further analysis.

Metering—an organization collects metering data. The vast majority of the metering activity is normal and not of particular interest. But on a few days of the year, certain metering data react in an unexpected manner. Only those readings that have reacted abnormally are brought to the existing system environment for further analysis.

And there are many more examples of repetitive raw big data being examined for exceptional data.

As a rule when data go from the big data environment to the existing system environment, it is convenient to place the data in a data warehouse. However, if there is a need, data can be sent elsewhere in the existing system environment.

Exception Based Data

Once the data in the raw repetitive big data environment are selected (usually chosen on an “exception basis”) and are then moved to the existing system environment, the exception-based data can undergo all sorts of analysis, such as the following:

  • - Pattern analysis. Why are the records that have been chosen exceptional? Is there a pattern of activity external to the records that match with the behavior of the records?
  • - Comparative analysis. Is the number of exceptional records increasing? Decreasing? What other events are happening concurrent to the collection of the exceptional records?
  • - Growth and analysis of exceptional records over time. Over time, what is happening to the exceptional records that have been collected from big data?

And there are MANY more ways to analyze the data that have been collected.

Fig. 8.2.2 shows the interface from big data to the existing system environment.

Fig. 8.2.2
Fig. 8.2.2 Big data contains repetitive data and nonrepetitive data.

The Nonrepetitive Raw Big Data/Existing Systems Interface

The interface from the nonrepetitive raw big data environment is one that is very different from the repetitive raw big data interface. The first major difference is in the percentage of data that are collected. Whereas in the repetitive raw big data interface, only a small percentage of the data are selected, in the nonrepetitive raw big data interface, the majority of the data are selected. This is because there is business value in the majority of the data found in the nonrepetitive raw big data environment, whereas there is little business value in the majority of the repetitive big data environment.

But there are other major differences as well.

The second major difference in the environments is in terms of context. In the repetitive raw big data environment, context is usually obvious and easy to find. In the nonrepetitive raw big data environment, context is not obvious at all and is not easy to find. It is noted that context is in fact there in the nonrepetitive big data environment; it just is not easy to find and is anything but obvious.

In order to find context, the technology of textual disambiguation is needed. Textual disambiguation reads the nonrepetitive data in big data and derives context from the data. (See the chapter on textual disambiguation and taxonomies for a more complete discussion of deriving context from nonrepetitive raw big data.)

While most of the nonrepetitive raw big data is useful, some percentage of data are not useful and are edited out by the process of textual disambiguation.

Once the context is derived, the output can then be sent to either the existing system environment.

Fig. 8.2.3 shows the interface from nonrepetitive raw big data to textual disambiguation.

Fig. 8.2.3
Fig. 8.2.3 Textual ETL is used for nonrepetitive data.

Into the Existing Systems Environment

Once data have come from nonrepetitive raw big data and have passed through textual disambiguation, the data can be passed to the existing system environment.

As the data are passed through textual disambiguation, they are greatly simplified. Context is derived, and each unit of text that passes the filtering process is turned into a flat file record. The flat file record is very reminiscent of a standard relational record. There are key and dependent data, as is found in a relational format.

The output can be sent to a load utility so that the output data can be placed in whatever DBMS is desired. Typical output DBMS include Oracle, Teradata, UDB/DB2, and SQL Server.

Fig. 8.2.4 shows the movement of data into the existing system environment in the form of a standard DBMS.

Fig. 8.2.4
Fig. 8.2.4 Among other things, textual ETL adds context to nonrepetitive data.

The “Context Enriched” Big Data Environment

The other route that data can take after they pass through textual disambiguation is that the output of data can be placed back into big data. There may be several reasons for wanting to send output back into big data. Some of the reasons include the following:

  • - The volume of data. There may be a lot of output from textual disambiguation. The sheer volume of data may dictate that the output data be placed back into the big data environment.
  • - The nature of the data. In some cases, the output data may have a natural fit with the other data placed in the big data environment. Future analytic processing may be greatly enhanced by placing output data back into big data.

In any case, after data pass through textual disambiguation and are placed back into big data, they enter big data in a very different state. When data have passed through textual disambiguation and are placed back into the big data environment, they are placed into the environment with the context of the data clearly identified and prominently a data part of the data in big data.

By placing the output of textual disambiguation back into big data, there now is a section of big data that can be called the context-enriched section of big data. From a structural standpoint, the context-enriched component of big data looks to be very similar to repetitive raw big data. The only difference is that the content-enriched component of big data has context open and obvious and attached to the data in this component of big data.

Fig. 8.2.5 shows that output data from textual disambiguation can be placed back into big data.

Fig. 8.2.5
Fig. 8.2.5 Textual ETL can place its results back into big data.

Another perspective of the big data environment is shown in Fig. 8.2.6.

Fig. 8.2.6
Fig. 8.2.6 Nonrepetitive data can contain raw data and context enriched data.

In Fig. 8.2.6, it is seen that there is the division of big data into the repetitive and nonrepetitive sections. However, in the repetitive section, it is seen that when content-enriched big data is added to the big data environment, those content-enriched data simply become another type of repetitive data. Stated differently, there are two types of repetitive data in big data—simple repetitive data and content-enriched repetitive data.

This division becomes important when doing analytic processing. Repetitive data are analyzed in a completely different fashion than content-enriched repetitive data.

Analyzing Structured Data/Unstructured Data Together

The final interface of interest in the big data environment is those data that have come from big data either through the distillation process or textual disambiguation. The data that arrive here can be placed into a standard DBMS.

Fig. 8.2.7 shows the database that has been created from unstructured data being placed in the same environment as the classical data warehouse. Of course, the data in the classical data warehouse have been created from structured data entirely.

Fig. 8.2.7
Fig. 8.2.7 The analytical environment can encompass both classical structured data and data whose origins are unstructured.

Fig. 8.2.7 shows that data whose origin is quite different can be placed in the same analytic environment. The DBMS may be oracle or Teradata. The operating system may be Windows or Linux. In any case, doing analytic processing against the two databases is as easy as doing a relational join.

In such a manner, it is easy and natural to do analytic processing against data from the two different environments. This means that structured data and unstructured data can be used together analytically.

By combining these two types of data together, entirely new vistas of analytic processing open up.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset