Chapter 2. Building Batch and Streaming Apps with Spark

The objective of the book is to teach you about PySpark and the PyData libraries by building an app that analyzes the Spark community's interactions on social networks. We will gather information on Apache Spark from GitHub, check the relevant tweets on Twitter, and get a feel for the buzz around Spark in the broader open source software communities using Meetup.

In this chapter, we will outline the various sources of data and information. We will get an understanding of their structure. We will outline the data processing pipeline, from collection to batch and streaming processing.

In this section, we will cover the following points:

  • Outline data processing pipelines from collection to batch and stream processing, effectively depicting the architecture of the app we are planning to build.
  • Check out the various data sources (GitHub, Twitter, and Meetup), their data structure (JSON, structured information, unstructured text, geo-location, time series data, and so on), and their complexities. We also discuss the tools to connect to three different APIs, so you can build your own data mashups. The book will focus on Twitter in the following chapters.

Architecting data-intensive apps

We defined the data-intensive app framework architecture blueprint in the previous chapter. Let's put back in context the various software components we are going to use throughout the book in our original framework. Here's an illustration of the various components of software mapped in the data-intensive architecture framework:

Architecting data-intensive apps

Spark is an extremely efficient, distributed computing framework. In order to exploit its full power, we need to architect our solution accordingly. For performance reasons, the overall solution needs to also be aware of its usage in terms of CPU, storage, and network.

These imperatives drive the architecture of our solution:

  • Latency: This architecture combines slow and fast processing. Slow processing is done on historical data in batch mode. This is also called data at rest. This phase builds precomputed models and data patterns that will be used by the fast processing arm once live continuous data is fed into the system. Fast processing of data or real-time analysis of streaming data refers to data in motion. Data at rest is essentially processing data in batch mode with a longer latency. Data in motion refers to the streaming computation of data ingested in real time.
  • Scalability: Spark is natively linearly scalable through its distributed in-memory computing framework. Databases and data stores interacting with Spark need to be also able to scale linearly as data volume grows.
  • Fault tolerance: When a failure occurs due to hardware, software, or network reasons, the architecture should be resilient enough and provide availability at all times.
  • Flexibility: The data pipelines put in place in this architecture can be adapted and retrofitted very quickly depending on the use case.

Spark is unique as it allows batch processing and streaming analytics on the same unified platform.

We will consider two data processing pipelines:

  • The first one handles data at rest and is focused on putting together the pipeline for batch analysis of the data
  • The second one, data in motion, targets real-time data ingestion and delivering insights based on precomputed models and data patterns

Processing data at rest

Let's get an understanding of the data at rest or batch processing pipeline. The objective in this pipeline is to ingest the various datasets from Twitter, GitHub, and Meetup; prepare the data for Spark MLlib, the machine learning engine; and derive the base models that will be applied for insight generation in batch mode or in real time.

The following diagram illustrates the data pipeline in order to enable processing data at rest:

Processing data at rest

Processing data in motion

Processing data in motion introduces a new level of complexity, as we are introducing a new possibility of failure. If we want to scale, we need to consider bringing in distributed message queue systems such as Kafka. We will dedicate a subsequent chapter to understanding streaming analytics.

The following diagram depicts a data pipeline for processing data in motion:

Processing data in motion

Exploring data interactively

Building a data-intensive app is not as straightforward as exposing a database to a web interface. During the setup of both the data at rest and data in motion processing, we will capitalize on Spark's ability to analyse data interactively and refine the data richness and quality required for the machine learning and streaming activities. Here, we will go through an iterative cycle of data collection, refinement, and investigation in order to get to the dataset of interest for our apps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset