Before we start this chapter, it is important that we discuss some trends that directly affect how we develop applications.
Big data applications can be divided into the following three categories:
- Batch
- Interactive
- Streaming or continuous applications
When Hadoop was designed, the primary focus was to provide cost-effective storage for large amounts of data. This remained the main show until it was upended by S3 and other cheaper and more reliable cloud storage alternatives. Compute on this large amounts of data in the Hadoop environment was primarily in the form of MapReduce jobs. Since Spark took the ball from Hadoop (OK! Snatched!) and started running with it, Spark also reflected batch orientation focus in the initial phase, but it did a better job than Hadoop in the case of exploiting in-memory storage.
All these trends were good, but end users still treated it as a sideshow (siloed applications somewhere in the dark alleys of the enterprise) while keeping their low-latency BI/query platforms, running on traditional databases, as the main show. Databases have three decades of head start, so it is understandable that they have optimized the stack to the last bit to improve performance.
During the same time, analytics itself was going through its own transformation. Analytics, which was mostly descriptive (also known as business intelligence (BI)/reporting), has evolved into predictive and prescriptive stages. It means that even if Spark had evolved into an engine for traditional BI/dashboarding, it would not have been enough. Spark has notebooks to fill this gap.
Notebooks provide an interactive playground where you can do queries in multiple languages (SQL/Python/Scala). You can run machine learning jobs. You can schedule jobs to be run at a certain time. There are two types of notebooks on the market:
- Open source offerings: Zeppelin and Jupyter
- Commercial XaaS offerings: Databricks Cloud
In this chapter, we will start with the Spark shell, which is a lightweight shell, that this book is focused on. We will also cover a combination of Maven/SBT and Eclipse/IDEA for IDE purposes. In the end, we will explore the notebooks offerings.