© Raul Estrada and Isaac Ruiz 2016

Raul Estrada and Isaac Ruiz, Big Data SMACK, 10.1007/978-1-4842-2175-4_2

2. Big Data, Big Solutions

Raul Estrada and Isaac Ruiz1

(1)Mexico City, Mexico

In Chapter 1, we answered the Why?. In this chapter, we will answer the How?. When you understand the Why, the answer to the How happens in only a matter of time.

This chapter covers the following topics:

  • Traditional vs. modern (big) data

  • SMACK in a nutshell

  • Spark, the engine

  • Mesos, the container

  • Akka, the model

  • Cassandra, the storage

  • Kafka, the broker

Traditional vs. Modern (Big) Data

Is time quantized? Is there an indivisible amount of time that cannot be divided? Until now, the correct answer to these questions was “Nobody knows.” The only certain thing is that on a human scale, life doesn’t happen in batch mode.

Many systems are monitoring a continuous stream of events: weather events, GPS signals, vital signs, logs, device metrics…. The list is endless. The natural way to collect and analyze this information is as a stream of data.

Handling data as streams is the correct way to model this behavior, but until recently, this methodology was very difficult to do well. The previous rates of messages were in the range of thousands of messages per second—the new technologies discussed in this book can deliver rates of millions of messages per second.

The point is this: streaming data is not a matter for very specialized computer science projects; stream-based data is becoming the rule for data-driven companies.

Table 2-1 compares the three approaches: traditional data, traditional big data, and modern big data.

Table 2-1. Traditional Data, Traditional Big Data, and Modern Big Data Approaches

CONCEPT

TRADITIONAL DATA

TRADITIONAL BIG DATA

MODERN BIG DATA

Person

• IT oriented

• IT oriented

• Business oriented

Roles

• Developer

• Data engineer

• Business user

  

• Data architect

• Data scientist

Data Sources

• Relational

• Relational

• Relational

 

• Files

• Files

• Files

 

• Message queues

• Message queues

• Message queues

  

• Data service

• Data service

   

• NoSQL

Data Processing

• Application server

• Application server

• Application server

 

• ETL

• ETL

• ETL

  

• Hadoop

• Hadoop

   

• Spark

Metadata

• Limited by IT

• Limited by model

• Automatically generated

   

• Context enriched

   

• Business oriented

   

• Dictionary based

User interface

• Self-made

• Self-made

• Self-made

 

• Developer skills required

• Developer skills required

• Built by business users

   

• Tools guided

Use Cases

• Data migration

• Data lakes

• Self-service

 

• Data movement

• Data hubs

• Internet of Things

 

• Replication

• Data warehouse offloading

• Data as a Service

Open Source Technologies

• Fully embraced

• Minimal

• TCO rules

Tools Maturity

• High

• Medium

• Low

 

• Enterprise

• Enterprise

• Evolving

Business Agility

• Low

• Medium

• Extremely high

Automation level

• Low

• Medium

• High

Governance

• IT governed

• Business governed

• End-user governed

Problem Resolution

• IT personnel solved

• IT personnel solved

• Timely or die

Collaboration

• Medium

• Low

• Extremely high

Productivity/Time to Market

• Slower

• Slower

• Highly productive

   

• Faster time to market

Integration Analysis

• Minimal

• Medium

• Modeled by analytical transformations

Real-time

• Minimal real time

• Minimal real time

• In real time or die

Data Access

• Primarily batch

• Batch

• Micro batch

Modern technologies and architectures allow you to build systems more easily and efficiently, and to produce a better model of the way business processes take place. We will explain the real value of a streaming architecture. The possibilities are vast.

Apache Spark is not a replacement for Hadoop. Spark is a computing engine, whereas Hadoop is a complete stack for storage, cluster management, and computing tools. Spark runs well over Hadoop.

Hadoop is a ten-year-old technology. Today, we see the rising of many deployments that are not on Hadoop, including deployments on NoSQL stores (like Cassandra) and deployments directly against cloud storage (e.g., Amazon S3). In this aspect, Spark is reaching a broader audience than Hadoop.

SMACK in a Nutshell

If you poll several IT people, we agree on a few things, including that we are always searching for a new acronym.

SMACK, as you already know, stands for Spark, Mesos, Akka, Cassandra, and Kafka. They are all open source technologies and all are Apache software projects, except Akka. The SMACK acronym was coined by Mesosphere, a company that, in collaboration with Cisco, bundled these technologies together in a product called Infinity, which was designed to solve some big data challenges where the streaming is fundamental.1

Big data architecture is required in the daily operation of many companies, but there are a lot of sources talking about each technology separately.

Let’s discuss the full stack and how to make the integration.

This book is a cookbook on how to integrate each technology in the most successful big data stack. We talk about the five main concepts of big data architecture and how to integrate/replace/reinforce every technology:

  • Spark: The engine

  • Mesos: The container

  • Akka: The model

  • Cassandra: The storage

  • Kafka: The message broker

Figure 2-1 represents the reference diagram for the whole book.

A420086_1_En_2_Fig1_HTML.jpg
Figure 2-1. SMACK at a glance

Apache Spark vs. MapReduce

MapReduce is a programming model for processing large data sets with a parallel and distributed algorithm on a cluster.

As we will see later, in functional programming, there are two basic methods: map(), which is dedicated filtering and sorting, and reduce(), which is dedicated to doing an operation. As an example, to serve a group of people at a service window, you must first queue (map) and then attend them (reduce).

The term MapReduce was coined in 1995, when the Message Passing Interface was used to solve programming issues, as we will discuss later. Obviously, when Google made the implementation, it had only one use case in mind: web search.

It is important to note that Hadoop born in 2006 and grew up in an environment where MapReduce reigned. MapReduce was born with two characteristics that mark its life: high latency and batch mode; both make it incapable to withstand modern challenges.

As you can see in Table 2-2, Spark is different.

Table 2-2. Apache Spark /MapReduce Comparison

CONCEPT

Apache Spark

MapReduce

Written in

Scala/Akka

Java

Languages Supported

Java, Scala, Python, and R are first-class citizens.

Everything should be written using Java.

Storage Model

Keeps things in memory

Keeps things in disk. Takes a long time to write things to disk and read them back, making it slow and laborious.

I/O Model

Keeps things in memory without I/O. Operates on the same data quickly.

Requires a lot of I/O activity over disk.

Recovery

Runs the same task in seconds or minutes. Restart is not a problem.

Records everything in disk, allowing restart after failure

Knowledge

The abstraction is high; codification is intuitive.

Could write MapReduce jobs intelligently, avoiding overusing resources, but requires specialized knowledge of the platform.

Focus

Code describes how to process data. Implementation details are hidden.

Apache Hive programming goes into code to avoid running too many MapReduce jobs.

Efficiency

Abstracts all the implementation to run it as efficiently as possible.

Programmers write complex code to optimize each MapReduce job.

Abstraction

Abstracts things like a good high-level programming language. It is a powerful and expressive environment.

Code is hard to maintain over time.

Libraries

Adds libraries for machine learning, streaming, graph manipulation, and SQL.

Programmers need third-party tools and libraries, which makes work complex.

Streaming

Real-time stream processing out of the box.

Frameworks like Apache Storm needed; increased complexity.

Source Code Size

Scala programs have dozens of lines of code (LOC).

Java programs have hundreds of LOC.

Machine Learning

Spark ML

If you want to do machine learning, you have to separately integrate Mahout, H2O, or Onyx. You have to learn how it works, and how to build it on.

Graphs

Spark GraphX

If you want to do graph databases, you have to select from Giraph, TitanDB, Neo4J, or some other technologies. Integration is not seamless.

Apache Spark has these advantages:

  • Spark speeds up application development 10 to 100 times faster, making applications portable and extensible.

  • Scala can read Java code. Java code can be rewritten in Scala in a much smaller form factor that is much easier to read, repurpose, and maintain.

  • When the Apache Spark core is improved, all the machine learning and graphs libraries are improved too.

  • Integration is easier: the applications are easier to maintain and costs go down.

If an enterprise bets on one foundation, Spark is the best choice today.

Databricks (a company founded by the Apache Spark creators) lists the following use cases for Spark:

  • ETL and data integration

  • Business intelligence and interactive analytics

  • Advanced analytics and machine learning

  • Batch computation for high performance

  • Real-time stream processing

Some of the new use cases are just the old use cases done faster; although some use cases are totally new. There are some scenarios that just can’t be done with acceptable performance on MapReduce.

The Engine

It is important to recall that Spark is better at OLAP (online analytical processing), which are batch jobs and data mining. Spark is not suitable for OLTP (online transaction processing), such as numerous atomic transactions; for this type of processing, we strongly recommend Erlang (a beautiful language inspired in the actor’s model).

Apache Spark has five main components:

  • Spark Core

  • Spark SQL

  • Spark Streaming

  • Spark MLib

  • Spark GraphX

Each Spark library typically has an entire book dedicated to it. In this book, we try to simply tackle the Apache Spark essentials to meet the SMACK stack.

The role of Apache Spark on the SMACK stack is to act as the processor and provide real-time data analysis. It addresses the aggregation and analysis layers.

There are few open source alternatives to Spark. As we’ve mentioned, Apache Hadoop is the classic approach. The strongest modern adversary is the Apache Flink project, which is good to keep in mind.

The Model

Akka is a model , a toolkit, and a runtime for building distributed, resilient, and highly concurrent message-driven applications on the Java virtual machine. In 2009, the Akka toolkit was released as open source. Language bindings exist for both Java and Scala. We need to first analyze Akka in order to understand the Spark architecture. Akka was designed based on the actor concurrency models:

  • Actors are arranged hierarchically

  • Asynchronous message (data) passing

  • Fault tolerant

  • Customizable failure and detection strategies

  • Hierarchical supervision

  • Adaptive, predictive

  • Parallelized

  • Load balance

There are many Akka competitors; we make a special mention of Reactor. The actor model is the foundation of many frameworks and languages. The main languages that are based on the actor model (called functional languages) are Lisp, Scheme, Erlang, Haskell, and recently, Scala, Clojure, F#, and Elixir (a modern implementation of Erlang).

The Broker

Apache Kafka is a publish/subscribe message broker redesigned as a distributed commit log. In SMACK, Kafka is the data ingestion point, mainly on the application layer. Kafka takes data from applications and streams and processes them into the stack. Kafka is a distributed messaging system with high throughput. It handles massive data load and floods. It is the valve that regulates the pressure.

Apache Kafka inspects incoming data volume, which is fundamental for partitioning and distribution among the cluster nodes. Apache Kafka’s features include the following:

  • Automatic broker failover

  • Very high performance distributed messaging

  • Partitioning and Distribution across the cluster nodes

  • Data pipeline decoupling

  • A massive number of consumers are supported

  • Massive data load handling

Kafka is the champion among a lot of competitors in MOM (message-oriented middleware). In the MQ family, this includes ActiveMQ, ZeroMQ, IronMQ, and RabbitMQ. The best of all is RabbitMQ, which is made with Erlang.

The best alternative to Kafka is Apache Storm, which has a lot of integration with Apache Hadoop. Keep it in mind. Apache Kafka is here to stay.

The Storage

Apache Cassandra is a distributed database. It is the perfect choice when you need to escalate and need hyper-high availability with no sacrifice in performance. Cassandra was originally used on Facebook in 2008 to handle large amounts of data. It became a top-level Apache project in 2010. Cassandra handles the stack’s operational data. Cassandra can also be used to expose data to the application layer.

The following are the main features of Apache Cassandra:

  • Extremely fast and scalable

  • Multi data center, no single point of failure

  • Survives when multiple nodes fault

  • Easy to operate

  • Flexible data modeling

  • Automatic and configurable replication

  • Ideal for real-time ingestion

  • Has a great Apache based community

There are a lot of Cassandra competitors, including DynamoDB (powered by Amazon; it’s contending in the NoSQL battlefield), Apache HBase (the best-known database implementation of Hadoop), Riak (made by the Basho samurais; it’s a powerful Erlang database), CouchBase, Apache CouchDB, MongoDB, Cloudant, and Redis.

The Container

Apache Mesos is a distributed systems kernel that is easy to build and effective to run. Mesos is an abstraction layer over all computer resources (CPU, memory, storage) on the machines (physical or virtual), enabling elastic distributed systems and fault tolerance. Mesos was designed with the Linux kernel principles at a higher abstraction level. It was first presented as Nexus in 2009. In 2011, it was relaunched by Matei Zaharia under its current name. Mesos is the base of three frameworks:

  • Apache Aurora

  • Chronos

  • Marathon

In SMACK, Mesos orchestrates components and manages resources. It is the secret for horizontal cluster scalation. Usually, Apache Mesos is combined with Kubernetes (the competitor used by the Google Cloud Platform) or with Docker (as you will see, more than a competitor, it is a complement to Mesos). The equivalent in Hadoop is Apache Yarn.

Summary

This chapter, like the previous one, was full of theory. We reviewed the fundamental SMACK diagram as well as Spark’s advantages over traditional big data technologies such as Hadoop and MapReduce. We also visited every technology in the SMACK stack, briefly presented each tool’s potential, and most importantly, we discussed the actual alternatives for each technology. The upcoming chapters go into greater depth on each of these technologies. We will explore the connectors and the integration practices, and link techniques, as well as describe alternatives to every situation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset