Chapter 3. Juggling Data with Spark

As per the batch and streaming architecture laid out in the previous chapter, we need data to fuel our applications. We will harvest data focused on Apache Spark from Twitter. The objective of this chapter is to prepare data to be further used by the machine learning and streaming applications. This chapter focuses on how to exchange code and data across the distributed network. We will get practical insights into serialization, persistence, marshaling, and caching. We will get to grips with on Spark SQL, the key Spark module to interactively explore structured and semi-structured data. The fundamental data structure powering Spark SQL is the Spark dataframe. The Spark dataframe is inspired by the Python Pandas dataframe and the R dataframe. It is a powerful data structure, well understood and appreciated by data scientists with a background in R or Python.

In this chapter, we will cover the following points:

  • Connect to Twitter, collect the relevant data, and then persist it in various formats such as JSON and CSV and data stores such as MongoDB
  • Analyze the data using Blaze and Odo, a spin-off library from Blaze, in order to connect and transfer data from various sources and destinations
  • Introduce Spark dataframes as the foundation for data interchange between the various Spark modules and explore data interactively using Spark SQL

Revisiting the data-intensive app architecture

Let's first put in context the focus of this chapter with respect to the data-intensive app architecture. We will concentrate our attention on the integration layer and essentially run through iterative cycles of the acquisition, refinement, and persistence of the data. This cycle was termed the five Cs. The five Cs stand for connect, collect, correct, compose, and consume. They are the essential processes we run through in the integration layer in order to get to the right quality and quantity of data retrieved from Twitter. We will also delve deeper in the persistence layer and set up a data store such as MongoDB to collect our data for processing later.

We will explore the data with Blaze, a Python library for data manipulation, and Spark SQL, the interactive module of Spark for data discovery powered by the Spark dataframe. The dataframe paradigm is shared by Python Pandas, Python Blaze, and Spark SQL. We will get a feel for the nuances of the three dataframe flavors.

The following diagram sets the context of the chapter's focus, highlighting the integration layer and the persistence layer:

Revisiting the data-intensive app architecture
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset