Chapter 1. Introduction to Data Analytics with Spark

This chapter covers an overview of Apache Spark, its computing paradigm, and installation to getting started. It will briefly describe the main components of Spark and focus on its new computing advancements. A description of the Resilient Distributed Datasets (RDD) and Dataset will be discussed as a base knowledge for the rest of this book. It will then focus on the Spark machine learning libraries. Installing and packaging a simple machine learning application with Spark and Maven will be demonstrated then before getting on board. In a nutshell, the following topics will be covered in this chapter:

  • Spark overview
  • New computing paradigm with Spark
  • Spark ecosystem
  • Spark machine learning libraries
  • Installing and getting started with Spark
  • Packaging your application with dependencies
  • Running a simple machine learning application

Spark overview

This section describes Spark (https://spark.apache.org/) basics followed by the issues with the traditional parallel and distributed computing, then how Spark was evolved, and it then brings a new computing paradigm across the big data processing and analytics on top of that. In addition, we also presented some exciting features of Spark that easily attract the big data engineers, data scientists, and researchers, including:

  • Simplicity of data processing and computation
  • Speed of computation
  • Scalability and throughput across large-scale datasets
  • Sophistication across diverse data types
  • Ease of cluster computing and deployment with different cluster managers
  • Working capabilities and supports with various big data storage and sources
  • Diverse APIs are written in widely used and emerging programming languages

Spark basics

Before praising Spark and its many virtues, a short overview is in the mandate. Apache Spark is a fast, in-memory, big data processing, and general-purpose cluster computing framework with a bunch of sophisticated APIs for advanced data analytics. Unlike the Hadoop-based MapReduce, which is only suited for batch jobs in speed and ease of use, Spark could be considered as a general execution engine that is suitable for applying advanced analytics on both static (batch) as well as real-time data:

  • Spark was originally developed at the University of California, Berkeley's AMPLab based on Resilient Distributed Datasets (RDDs), which provides a fault-tolerant abstraction for in-memory cluster computing facilities. However, later on Spark's code base was bequeathed to the Apache Software Foundation making it open source, since then open source communities are taking care of it. Spark provides an interface to perform data analytics on entire clusters at scale with implicit data parallelism and fault-tolerance through its high-level APIs written in Java, Scala, Python, and R.

In Spark 2.0.0, elevated libraries (most widely used data analysis algorithms) are implemented, including:

  • Spark SQL for querying and processing large-scale structured data
  • SparkR for statistical computing that provides distributed computing using programming language R at scale
  • MLlib for machine learning (ML) applications, which is internally divided into two parts; MLlib for RDD-based machine learning application development and Spark ML for a high-level abstraction to develop complete computational data science and machine learning workflows
  • GraphX for large-scale graph data processing
  • Spark Streaming for handling large-scale real-time streaming data to provide a dynamic working environment to static machine learning

Since its first stable release, Spark has already experienced dramatic and rapid development as well as wide adoptions through active initiatives from a wide range of IT solution providers, open source communities, and researchers around the world. Recently it has emerged as one of the most active, and the largest open source project in the area of big data processing and cluster computing, not only for its comprehensive adoptions, but also deployments and surveys by IT peoples, data scientists, and big data engineers worldwide. As quoted by Matei Zaharia, founder of Spark and the CTO of Databricks on the Big Data analytics news website at: http://bigdataanalyticsnews.com/apache-spark-3-real-world-use-cases/:

It's an interesting thing. There hasn't been as much noise about it commercially, but the actual developer community votes with its feet and people are actually getting things done and working with the project.

Even though many Tech Giants such as Yahoo, Baidu, Conviva, ClearStory, Hortonworks, Gartner, and Tencent are already using Spark in production - on the other hand, IBM, DataStax, Cloudera, and BlueData provide the commercialized Spark distribution for the enterprise. These companies have enthusiastically deployed Spark applications at a massive scale collectively for processing multiple petabytes of data on clusters of 8,000 nodes, which is the largest known cluster of Spark.

Beauties of Spark

Are you planning to develop a machine learning (ML) application? If so, you probably already have some data to perform preprocessing before you train a model on that data, and finally, you will be using the trained model to make predictions on new data to see the adaptability. That's all you need? We guess no, since you have to consider other parameters as well. Obviously, you will desire your ML models to be working perfectly in terms of accuracy, execution time, memory usage, throughput, tuning, and adaptability. Wait! Still not done yet; what happens if you would like to make your application handle large training and new datasets at scale? Or as a data scientist, what if you could build your ML models to overcome these issues as a multi-step journey from data incorporation through train and error to production by running the same machine learning code on the big cluster and the personal computer without breaking down further? You can simply rely on Spark and close your eyes.

Spark has several advantages over other big data technologies such as MapReduce (you can refer to https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for MapReduce tutorials and the research paper MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean et al, In proc of OSDI, 2004 to get to know more) and Storm, which is a free and open source distributed real-time computation system (please refer to http://storm.apache.org/ for more on Storm-based distributed computing). First of all, Spark gives a comprehensive, unified engine to manage big data processing requirements with a variety of datasets such as text and tabular to graph data as well as the source of data (batch and real-time streaming data) that are diverse in nature. As a user (data science engineers, academicians, or developers), you can be likely benefited from Spark's rapid application development through simple and easy-to-understand APIs across batches, interactive, and real-time streaming applications.

Working and programming with Spark is easy and simple. Let us show you an example of that. Yahoo is one of the contributors and an early adopter of Spark, who implemented an ML algorithm with 120 lines of Scala code. With just 30 minutes of training on a large dataset with a hundred million records, the Scala ML algorithm was ready for business. Surprisingly, the same algorithm was written using C++ in 15,000 lines of code previously (please refer to the following URL for more at: https://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/). You can develop your applications using Java, Scala, R, or Python with a built-in set of over 100 high-level operators (mostly supported after Spark release 1.6.1) for transforming datasets and getting the familiarity with the data frame APIs for manipulating semi-structured, structured, and streaming data. In addition to the Map and Reduce operations, it supports SQL queries, streaming data, machine learning, and graph data processing. Moreover, Spark also provides an interactive shell written in Scala and Python for executing your codes sequentially (such as SQL or R style).

The main reason Spark adopts so quickly is because of two main factors: speed and sophistication. Spark provides order-of-magnitude performance for many applications using coarse-grained, immutable, and sophisticated data called Resilient Distributed Datasets that are spread across the cluster and that can be stored in memory or disks. An RDD offers fault-tolerance, which is resilient in a sense that it cannot be changed once created. Moreover, Spark's RDD has the property of recreating from its lineage if it is lost in the middle of computation. Furthermore, the RDD can be distributed automatically across the clusters by means of partitions and it holds your data. You can also keep it on your data on memory by the caching mechanism of Spark, and this mechanism enables big data applications in Hadoop-based MapReduce clusters to execute up to 100 times faster for in-memory if executed iteratively and even 10 times faster for disk-based operation.

Let's look at a surprising statistic about Spark and its computation powers. Recently, Spark took over Hadoop-based MapReduce by completing the 2014 Gray Sort Benchmark in the 100 TB category, which is an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records) (please refer to http://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html and http://sortbenchmark.org/). Finally, it becomes the open source engine (please refer to the following URL for more information https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html) for sorting at petabyte scale. In comparison, the previous world record set by Hadoop MapReduce had to use 2100 machines, taking 72 minutes of execution time, which implies Spark sorted the same data three times faster using 10 times fewer machines. Moreover, you can combine multiple libraries seamlessly to develop large-scale machine learning and data analytics pipelines to execute the job on various cluster managers such as Hadoop YARN, Mesos, or in the cloud by accessing data storage and sources such as HDFS, Cassandra, HBase, Amazon S3, or even RDBMs. Moreover, the job can be executed as a standalone mode on a local PC or cluster, or even on AWS EC2. Therefore, deployment of a Spark application on the cluster is easy (we will show more on how to deploy a Spark application on the cluster later in this chapter).

The other beauties of Spark are: it is open source and platform independent. These two are also its greatest advantage, which is it's free to use, distribute, and modify and develop an application on any platform. An open source project is also more secure as the code is accessible to everyone and anyone can fix bugs as they are found. Consequently, Spark has evolved so rapidly that it has become the largest open source project concerning big data solutions with 750+ contributors from 200+ organizations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset