Chapter 4.3

Parallel Processing

Abstract

There are different definitions of big data. The definition used here is that big data encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman census” method, and stores data in an unstructured format. There are two major types of big data—repetitive big data and nonrepetitive big data. Only a small fraction of repetitive big data has business value, whereas almost all of nonrepetitive big data has business value. In order to achieve business value, the context of data in big data must be determined. Contextualization of repetitive big data is easily achieved. But contextualization of nonrepetitive data is done by means of textual disambiguation.

Keywords

Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data; Contextualization; Textual disambiguation

The very essence of big data is the ability to handle very large volumes of data. Fig. 4.3.1 symbolically depicts a lot of data.

Fig. 4.3.1
Fig. 4.3.1 A lot of data.

There are so much data that need to be handled by big data that trying to load, access, and manipulate the data is a real challenge. It is safe to say that no computer is capable of handling all the data that can be accumulated in the big data environment.

The only possible strategy is to use multiple processors to handle the volume of data found in big data. In order to understand why it is mandatory to use multiple processors, consider the (old) story about the farmer that drives his crop to the marketplace in a wagon. When the farmer is first starting out, he doesn’t have much of a crop. He uses a donkey to pull the wagon. But as the years pass by, the farmer raises bigger crops. Soon, he needs a bigger wagon. And he needs a horse to pull the wagon. Then, one day, the crop that is put in the wagon becomes immense, and the farmer doesn’t just need a horse. The farmer needs a large Clydesdale horse.

Time passes, and the farmer prospers even more, and the crop continues to grow. One day, even a Clydesdale horse is not large enough to pull the wagon. The day comes where multiple horses are required to pull the wagon. Now, the farmer has a whole new set of problems. A new rigging is required. A trained driver is required to coordinate the team of horses that pull the wagon.

The same phenomenon occurs where there are lots of data. Multiple processors are required to load and manipulate the volumes of data found in big data.

In a previous chapter, there was a discussion of the “Roman census” method. The Roman census method is one of the ways in which parallelization of processing for the management of large amounts of data can occur.

Fig. 4.3.2 depicts the parallelization that occurs in the Roman census approach.

Fig. 4.3.2
Fig. 4.3.2 Processors linked together to provide parallel processing.

Fig. 4.3.2 shows that multiple processors are linked together to operate in a coordinated manner. Each processor controls and manages its own data. Collectively, the data that are managed constitute the volumes of data known as “big data.”

Note that the network is irregular in terms of its shape. Note that new nodes can be easily added to the network. Also note that the processing that occurs in one node is entirely independent of the processing that occurs in another node. Fig. 4.3.3 shows that several nodes can be processed at the same time as other nodes.

Fig. 4.3.3
Fig. 4.3.3 Processors executing independently.

An interesting thing about parallelization is that the total number of machine cycles required to process big data is not reduced by parallelization. In fact, the total number of machine cycles required is actually raised by parallelization, due to the fact that coordination of processing across different nodes is now required. Instead, the total elapsed time is what is reduced by introducing parallelization. The more parallelization there is, the less elapsed time there is to manage the data found in big data.

There are different forms of parallelization. The Roman census method is not the only form of parallelization. Another classical form of parallelization is that seen in Fig. 4.3.4.

Fig. 4.3.4
Fig. 4.3.4 An MPP—massively parallel processor.

The form of parallelization seen in Fig. 4.3.4 is called the “massively parallel processing” (MPP) approach to the management of data. In the MPP form of parallelization, each processor controls its own data (as is the case where the Roman census approach is used.) But in the MPP approach, there is a tight coordination of processing across the nodes. The tight control of the nodes can be accomplished by the fact that before the data are loaded, they are parsed and defined to fit the MPP data structure. Fig. 4.3.5 shows the parsing and fitting of the data to the MPP structure.

Fig. 4.3.5
Fig. 4.3.5 Text is parsed then placed in the appropriate processor.

Fig. 4.3.5 shows that in the MPP architecture, the parsing of the data greatly affects the placement of the data. One record is placed on one node. Another record is placed on another node.

The great benefit of parsing the data and using the parsing information as the basis for the placement of data is that the data are efficient to locate. When an analyst wishes to locate a unit of data, the analyst specifies the value of data that is of interest to the system. The system uses the algorithm that was used to place the data into the database (typically a hashing algorithm), and the system locates the data very efficiently.

In the Roman census approach to parallelization, the sequence of events is different from the MPP approach. In the Roman census approach, query is sent to the system to search for some data. The data managed by a node are searched and then parsed. Upon parsing, the system knows it has found the data that were being sought.

Fig. 4.3.6 shows the parsing that occurs.

Fig. 4.3.6
Fig. 4.3.6 Parsing is done in parallel.

It is seen from Fig. 4.3.6 that in order to find a single instance of data, quite a bit of work has to be done by the system. But, given that there are lots of processors, the elapsed time to do the search can be cut into a reasonable amount of time. If it were not for parallelism, the amount of time to do a search would be abhorrent.

There is some good news however. The good news is that parsing repetitive data is a fairly straightforward exercise. Fig. 4.3.7 shows the parsing of repetitive data.

Fig. 4.3.7
Fig. 4.3.7 Parsing repetitive data.

Fig. 4.3.7 shows that in the case of repetitive data in big data, the parsing algorithm is fairly straightforward. Relative to other data found in the repetitive record, there is very little contextual information, and where there is contextual information, it is found easily. This means that the work done by the parser is fairly simple work. (Note: the term “simple” here is entirely relative to the work that must be done by the parser elsewhere.)

Contract the parsing of repetitive data versus the parsing of nonrepetitive data.

Fig. 4.3.8 shows the parsing of nonrepetitive data.

Fig. 4.3.8
Fig. 4.3.8 Parsing nonrepetitive data.

The parsing of nonrepetitive is an entirely different matter than the parsing of repetitive data. In fact, the term—“parsing of nonrepetitive data”—is often referred to as textual disambiguation. There is much more to the reading of nonrepetitive data than merely parsing it.

However it is done, nonrepetitive data are read and turned into a form that can be managed by a database management system.

There is a very good reason why nonrepetitive data require well beyond a parsing algorithm. The reason is that context in nonrepetitive data hides in many and complex forms. For that reason, textual disambiguation is usually done external to the nonrepetitive data in big data. (In other words, because of the inherent complexity of nonrepetitive data, textual disambiguation is done outside of the database system that manages big data.)

A related issue to parallel processing in the big data environment is that of the efficiency of queries. As seen in Fig. 4.3.6, when a simple query is done against big data, the parsing of the entire set of data contained in big data must be parsed. Even though the data are managed in parallel, such a full database scan of data causes many machine resources to be used.

An alternate approach is to scan the data once and create a separate index. This approach works only for repetitive data, not nonrepetitive data. Once the index for the repetitive data is created, it can be scanned much more efficiently than doing a full table scan. Once the index is created, there no longer is a need to do a full table scan every time big data needs to be searched.

Of course, the index must be maintained. Every time data are added to the big data collection of repetitive data, an update to the index is required.

In addition, the designer must know what contextual information is available at the moment of the building of the index.

Fig. 4.3.9 shows the building on an index from the contextual data found on repetitive data.

Fig. 4.3.9
Fig. 4.3.9 Building an index on repetitive data.

One of the issues of creating a separate index on data found in repetitive data is that the index that is created is application-specific. The designer must know what data to look for before the index is built.

Fig. 4.3.10 displays the application-specific nature of building an index for repetitive data in big data.

Fig. 4.3.10
Fig. 4.3.10 The application nature of building an index on repetitive data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset