New computing paradigm with Spark

In this section, we will show a chronology of Spark that will provide a concept of how it was evolved and emerged as a revolution for big data processing and cluster computing. In addition to this, we will also describe the Spark ecosystem in brief to understand the features and facilities of Spark in more details.

Traditional distributed computing

The traditional data processing paradigm is commonly referred to as a client-server model, which people used to move data to the code. The database server (or simply the server) was mainly responsible for performing data operations and then returning the results to the client-server (or simply the client) program. However, when the number of task to be computed is increased, a variety of operations and client devices also started to increase exponentially. As a result, a progressively complex array of computing endpoint in servers also started in the background. So to keep this type of computing model we need to increase the application (client) servers and database server in balance for storing and processing the increased number of operations. Consequently, the data propagation between nodes and data transfer back and forth across this network also increases drastically. Therefore, the network itself becomes a performance bottleneck. As a result, the performance (in terms of both the scalability and throughput) in this kind of computing paradigm also decreases undoubtedly. It is shown in the following diagram:

Traditional distributed computing

Figure 1: Traditional distributed processing in action.

After the successful completion of human genome projects in life sciences, real-time IOT data, sensor data, data from mobile devices, and data from the Web are creating the data-deluge and contributing for the big data, which has mostly evolved the data-intensive computing. The data-intensive computing nowadays is now flattering increasingly in an emerging way, which requires an integrated infrastructure or computing paradigm, so that the computational resources and data could be brought in a common platform and perform the analytics on top of it. The reasons are diverse because big data is really huge in terms of complexity (volume, variety, and velocity), and from the operational perspective four ms (that is, move, manage, merge, and munge).

In addition, since we will be talking about large-scale machine learning applications in this book, we also need to consider some addition and critical assessing parameters such as validity, veracity, value, and visibility to grow the business. Visibility is important, because suppose you have a big dataset with a size of 1 PB; however, if there is no visibility, everything is a black hole. We will explain more on big data values in upcoming chapters.

It may not be feasible to store and process these large-scale and complex big datasets in a single system; therefore, they need to be partitioned and stored across multiple physical machines. Well, big datasets are partitioned or distributed, but to process and analyze these rigorously complex datasets, both the database servers as well as application servers might need to be increased to intensify the processing power at a large-scale. Again, the same performance bottleneck issues arrive at worst in multi-dimension that requires a new and more data-intensive big data processing and related computing paradigm.

Moving code to the data

To overcome the issues mentioned previously, a new computing paradigm is desperately needed so that instead of moving data to the code/application, we could move the code or application to the data and perform the data manipulation, processing, and associated computing at home (that is, where the data is stored). As you understand the motivation and objective, now the reverts programming model can be called move code to data and do parallel processing on distributed system, which can be visualized in the following diagram:

Moving code to the data

Figure 2: New computing (move code to data and do parallel processing on distributed system).

To understand the workflows illustrated in Figure 2, we can envisage a new programming model described as follows:

  • Execution of a big data processing using your application initiated at your personal computer (let's name it Driver Program), which coordinates the execution in action remotely across multiple computing nodes within a cluster or grid, or a more openly speaking cloud.
  • Now what you need is to transfer your developed application/algorithm/code segments (could be invoked or revoked using command-line or shell scripting as a simple programming language notation) to the computing/worker nodes (having large storage, main memory, and processing capability). We can simply imagine that the data to be computed or manipulated is already stored in those computing nodes as partitions or blocks.
  • It is also understandable that the bulk data no longer needs to be transferred (upload/download) to your driver program because of the network or computing bottleneck, but it only holds the data reference in its variable instead, which is basically an address (hostname/IP address with a port) to locate the physical data stored in the computing nodes in a cluster, for example (of course bulk-upload could be performed using other solutions, such as scalable provisioning that is to be discussed in later chapters).
  • So what do the remote computing nodes have? They have the data as well as code to perform the data computations and necessary processing to materialize the output or modified data without leaving their home (more technically, the computing nodes).
  • Finally, upon your request, only the results could be transferred across the network to your driver program for validation or other analytics since there are many subsets of the original datasets.

It's worth noticing that by moving the code to the data, the computing structure has been changed drastically. Most interestingly, the volume of data transfer across the network has significantly reduced. The justification here is that you will be transferring only a small chunk of software code to the computing nodes and receiving a small subset of the original data as results in return. This was the most important paradigm shifting for big data processing that Spark brought to us with the concept of RDD, datasets, DataFrame, and other lucrative features that imply great revolution in the history of big data engineering and cluster computing. However, for brevity, in the next section we will only discuss the concepts of RDD and the other computing features will be discussed in upcoming sections

RDD – a new computing paradigm

To understand the new computing paradigm, we need to understand the concept of Resilient Distributed Datasets (RDDs), by which and how Spark has implemented the data reference concept. As a result, it has been able to shift the data processing easily to take it at scale. The basic thing about RDD is that it helps you to treat your input datasets almost like any other data objects. In other words, it brings the diversity of input data types, which you greatly missed in the Hadoop-based MapReduce framework.

An RDD provides the fault-tolerance capability in a resilient way in a sense that it cannot be changed once created and the Spark engine will try to iterate the operation once failed. It is distributed because once it has created performed partition operations, RDDs are automatically distributed across the clusters by means of partitions. RDDs let you play more with your input datasets since RDDs can also be transformed into other forms rapidly and robustly. In parallel, RDDs can also be dumped through an action and shared across your applications that are logically co-related or computationally homogeneous. This is achievable because it is a part of Spark's general-purpose execution engine to gain massive parallelism, so it can virtually be applied in any type of datasets.

However, to make the RDD and related operation on your inputs, Spark engines require you to make a distinguishing borderline between the data pointer (that is, the reference) and the input data itself. Basically, your driver program will not hold data, but only the reference of the data where the data is actually located on the remote computing nodes in a cluster.

To make the data processing faster and easier, Spark supports two types of operations, which can be performed on RDDs: transformations and actions (please refer to Figure 3). A transformation operation basically creates a new dataset from an existing one. An action, on the other hand, materializes a value to the driver program after a successful computation on input datasets on the remote server (computing nodes to be more exact).

The style of data execution initiated by the driver program builds up a graph as a Directed Acyclic Graph (DAG) style; where nodes represent RDDs and the transformation operations are represented by the edges. However, the execution itself does not start in the computing nodes in a Spark cluster until an action operation is performed. Nevertheless, before starting the operation, the driver program sends the execution graph (that signifies the style of operation for the data computation pipelining or workflows) and the code block (as a domain-specific script or file) to the cluster and each worker/computing node receives a copy from the cluster manager node:

RDD – a new computing paradigm

Figure 3: RDD in action (transformation and action operation).

Before proceeding to the next section, we argue you to learn about the action and transformation operation in more detail. Although we will discuss these two operations in Chapter 3, Understanding the Problem by Understanding the Data in detail. There are two types of transformation operations currently supported by Spark. The first one is the narrow transformation, where data mingle is unnecessary. Typical Spark narrow transformation operations are performed using the filter(), sample(), map(), flatMap(), mapPartitions() , and other methods. The wide transformation is essential to make a wider change to your input datasets so that the data could be brought in a common node out of multiple partitions of data shuffling. Wide transformation operations include groupByKey(), reduceByKey(), union(), intersection(), join(), and so on.

An action operation returns the final results of RDD computations from the transformation by triggering execution as a Directed Acyclic Graph (DAG) style to the Driver Program. But the materialized results are actually written on the storage, including the intermediate transformation results of the data objects and return the final results. Common actions include: first(), take(), reduce(), collect(), count(), saveAsTextFile(), saveAsSequenceFile(), and so on. At this point we believe that you have gained the basic operation on top of RDDs, so we can now define an RDD and related programs in a natural way. A typical RDD programming model that Spark provides can be described as follows:

  • From an environment variable, Spark Context (Spark shell or Python Pyspark provides you with a Spark Context or you can make your own, this will be described later in this chapter) creates an initial data reference RDD object.
  • Transform the initial RDD to create more RDDs objects following the functional programming style (to be discussed later on).
  • Send the code/algorithms/applications from the driver program to the cluster manager nodes. Then the cluster manager provides a copy to each computing node.
  • Computing nodes hold a reference of the RDDs in its partition (again, the driver program also holds a data reference). However, computing nodes could have the input dataset provided by the cluster manager as well.
  • After a transformation (via either narrow or wider transformation), the result to be generated is a brand new RDD, since the original one will not be mutated. Finally, the RDD object or more (specifically data reference) is materialized through an action to dump the RDD into the storage.
  • The Driver Program can request the computing nodes for a chunk of results for the analysis or visualization of a program.

Wait! So far we have moved smoothly. We suppose you will ship your application code to the computing nodes in the cluster. Still you will have to upload or send the input datasets to the cluster to be distributed among the computing nodes. Even during the bulk-upload you will have to transfer the data across the network. We also argue that the size of the application code and results are negligible or trivial. Another obstacle is if you/we want Spark to process the data at scale computation, it might require data objects to be merged from multiple partitions first. That means we will need to shuffle data among the worker/computing nodes that are usually done by partition(), intersection(), and join() transformation operations.

So frankly speaking, the data transfer has not been eliminated completely. As we and you understand the overheads being contributed especially for the bulk upload/download of these operations, their corresponding outcomes are as follows:

RDD – a new computing paradigm

Figure 4: RDD in action (the caching mechanism).

Well, it's true that we have been compromised with these burdens. However, situations could be tackled or reduced significantly using the caching mechanism of Spark. Imagine you are going to perform an action multiple times on the same RDD objects, which have a long lineage; this will cause an increase in execution time as well as data movement inside a computing node. You can remove (or at least reduce) this redundancy with the caching mechanism of Spark (Figure 4) that stores the computed result of the RDD in the memory. This eliminates the recurrent computation every time. Because, when you cache on an RDD, its partitions are loaded into the main memory instead of a disk (however, if there is not enough space in the memory, the disk will be used instead) of the nodes that hold it. This technique enables big data applications on Spark clusters to outperform MapReduce significantly for each round of parallel processing. We will discuss more on Spark data manipulations and other techniques in Chapter 3, Understanding the Problem by Understanding the Data in detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset