Introduction

Before we start this chapter, it is important that we discuss some trends that directly affect how we develop applications.

Big data applications can be divided into the following three categories:

Batch
Interactive
Streaming or continuous applications

When Hadoop was designed, the primary focus was to provide cost-effective storage for large amounts of data. This remained the main show until it was upended by S3 and other cheaper and more reliable cloud storage alternatives. Compute on this large amounts of data in the Hadoop environment was primarily in the form of MapReduce jobs. Since Spark took the ball from Hadoop (OK! Snatched!) and started running with it, Spark also reflected batch orientation focus in the initial phase, but it did a better job than Hadoop in the case of exploiting in-memory storage.

The most compelling factor of the success of Hadoop was that the cost of storage was hundreds of times lower than traditional data warehouse technologies, such as Teradata.

All these trends were good, but end users still treated it as a sideshow (siloed applications somewhere in the dark alleys of the enterprise) while keeping their low-latency BI/query platforms, running on traditional databases, as the main show. Databases have three decades of head start, so it is understandable that they have optimized the stack to the last bit to improve performance.

During the same time, analytics itself was going through its own transformation. Analytics, which was mostly descriptive (also known as business intelligence (BI)/reporting), has evolved into predictive and prescriptive stages. It means that even if Spark had evolved into an engine for traditional BI/dashboarding, it would not have been enough. Spark has notebooks to fill this gap.

Notebooks provide an interactive playground where you can do queries in multiple languages (SQL/Python/Scala). You can run machine learning jobs. You can schedule jobs to be run at a certain time. There are two types of notebooks on the market:

Open source offerings: Zeppelin and Jupyter
Commercial XaaS offerings: Databricks Cloud

In this chapter, we will start with the Spark shell, which is a lightweight shell, that this book is focused on. We will also cover a combination of Maven/SBT and Eclipse/IDEA for IDE purposes. In the end, we will explore the notebooks offerings.

Please note that all the commands that run on this shell can run as Scala code bundled as JARs (using spark-submit flag). The same code can also be executed on any of the notebooks "as is" without making any changes.

Table of Contents for Introduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction