132 | Big Data Simplied
val words = Array(“one”, “two”, “two”, “three”, “three”, “three”)
valwordPairsRDD = sc.parallelize(words).map(word => (word, 1))
valwordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _).collect()
Points to Ponder
❐ Resilient Distributed Datasets (RDD) is the fundamental data structure of Spark. RDDs are
immutable and fault tolerant in nature. These are distributed collections of objects.
❐ In RDD, the datasets are divided into a logical partition, which is further computed on different
nodes over the cluster.
❐ Spark operation has categorized into two parts, such as Transformations and Actions.
❐ Pair RDD is one type of RDD that data combined with <key, value> pair.
❐ The SparkContext object is created by the Spark driver.
❐ Spark is much faster than conventional Map Reduce due to its speedy memory execution.
6.1.4 Spark Libraries: Spark SQL
The rst library that we are going to study is referred to as Spark SQL. It is named as such because
it works with your data similar to an SQL process. It allows developers to write declarative code,
letting the engine use as much of the data and storage structure as it can to optimize the result
and distributed query behind the scenes. The goal here is to allow the user not to worry about
the distributed nature and focus on the business use-case.
How does Spark SQL compare to its key competitors?
Let us consider Hive. Hive is slower and it often requires complex custom user-defined func-
tions, simply to extend its functionality. Unit testing in Hive can also present its own challenges.
However, Hive is more mature. If you already have an existing Hive database, then there are
mechanisms in Spark to take advantage of the existing Hive table structures, even running 10 to
100 times faster than pure Hive.
Impala, on the other hand, is an established C++ tool which tends to beat Spark in direct
performance benchmarks. It was built from scratch for more specific cases than Spark’s general
engine, so it is especially optimized for this task.
Other than Hive, there are other natively supported data sources, such as JSON and Parquet.
If not native, then it is still fairly simple to pool an external library through Spark packages and
gain support for Avro, Redshift, CSV and a number of other data sources.
Spark SQL has three categories and they are briefly explained below.
• DataFrame: A Spark DataFrame is a distributed collection of data organized into named col-
umns that provides operations to filter, group or compute aggregates and it can be used with
M06 Big Data Simplified XXXX 01.indd 132 5/17/2019 2:49:11 PM