Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Raul Estrada and Isaac Ruiz, Big Data SMACK, 10.1007/978-1-4842-2175-4_2

2. Big Data, Big Solutions

Raul Estrada¹ and Isaac Ruiz¹

(1)Mexico City, Mexico

In Chapter 1, we answered the Why?. In this chapter, we will answer the How?. When you understand the Why, the answer to the How happens in only a matter of time.

This chapter covers the following topics:

Traditional vs. modern (big) data
SMACK in a nutshell
Spark, the engine
Mesos, the container
Akka, the model
Cassandra, the storage
Kafka, the broker

Traditional vs. Modern (Big) Data

Is time quantized? Is there an indivisible amount of time that cannot be divided? Until now, the correct answer to these questions was “Nobody knows.” The only certain thing is that on a human scale, life doesn’t happen in batch mode.

Many systems are monitoring a continuous stream of events: weather events, GPS signals, vital signs, logs, device metrics…. The list is endless. The natural way to collect and analyze this information is as a stream of data.

Handling data as streams is the correct way to model this behavior, but until recently, this methodology was very difficult to do well. The previous rates of messages were in the range of thousands of messages per second—the new technologies discussed in this book can deliver rates of millions of messages per second.

The point is this: streaming data is not a matter for very specialized computer science projects; stream-based data is becoming the rule for data-driven companies.

Table 2-1 compares the three approaches: traditional data, traditional big data, and modern big data.

Table 2-1. Traditional Data, Traditional Big Data, and Modern Big Data Approaches

CONCEPT	TRADITIONAL DATA	TRADITIONAL BIG DATA	MODERN BIG DATA
Person	• IT oriented	• IT oriented	• Business oriented
Roles	• Developer	• Data engineer	• Business user
		• Data architect	• Data scientist
Data Sources	• Relational	• Relational	• Relational
	• Files	• Files	• Files
	• Message queues	• Message queues	• Message queues
		• Data service	• Data service
			• NoSQL
Data Processing	• Application server	• Application server	• Application server
	• ETL	• ETL	• ETL
		• Hadoop	• Hadoop
			• Spark
Metadata	• Limited by IT	• Limited by model	• Automatically generated
			• Context enriched
			• Business oriented
			• Dictionary based
User interface	• Self-made	• Self-made	• Self-made
	• Developer skills required	• Developer skills required	• Built by business users
			• Tools guided
Use Cases	• Data migration	• Data lakes	• Self-service
	• Data movement	• Data hubs	• Internet of Things
	• Replication	• Data warehouse offloading	• Data as a Service
Open Source Technologies	• Fully embraced	• Minimal	• TCO rules
Tools Maturity	• High	• Medium	• Low
	• Enterprise	• Enterprise	• Evolving
Business Agility	• Low	• Medium	• Extremely high
Automation level	• Low	• Medium	• High
Governance	• IT governed	• Business governed	• End-user governed
Problem Resolution	• IT personnel solved	• IT personnel solved	• Timely or die
Collaboration	• Medium	• Low	• Extremely high
Productivity/Time to Market	• Slower	• Slower	• Highly productive
			• Faster time to market
Integration Analysis	• Minimal	• Medium	• Modeled by analytical transformations
Real-time	• Minimal real time	• Minimal real time	• In real time or die
Data Access	• Primarily batch	• Batch	• Micro batch

Modern technologies and architectures allow you to build systems more easily and efficiently, and to produce a better model of the way business processes take place. We will explain the real value of a streaming architecture. The possibilities are vast.

Apache Spark is not a replacement for Hadoop. Spark is a computing engine, whereas Hadoop is a complete stack for storage, cluster management, and computing tools. Spark runs well over Hadoop.

Hadoop is a ten-year-old technology. Today, we see the rising of many deployments that are not on Hadoop, including deployments on NoSQL stores (like Cassandra) and deployments directly against cloud storage (e.g., Amazon S3). In this aspect, Spark is reaching a broader audience than Hadoop.

SMACK in a Nutshell

If you poll several IT people, we agree on a few things, including that we are always searching for a new acronym.

SMACK, as you already know, stands for Spark, Mesos, Akka, Cassandra, and Kafka. They are all open source technologies and all are Apache software projects, except Akka. The SMACK acronym was coined by Mesosphere, a company that, in collaboration with Cisco, bundled these technologies together in a product called Infinity, which was designed to solve some big data challenges where the streaming is fundamental.¹

Big data architecture is required in the daily operation of many companies, but there are a lot of sources talking about each technology separately.

Let’s discuss the full stack and how to make the integration.

This book is a cookbook on how to integrate each technology in the most successful big data stack. We talk about the five main concepts of big data architecture and how to integrate/replace/reinforce every technology:

Spark: The engine
Mesos: The container
Akka: The model
Cassandra: The storage
Kafka: The message broker

Figure 2-1 represents the reference diagram for the whole book.

Figure 2-1. SMACK at a glance

Apache Spark vs. MapReduce

MapReduce is a programming model for processing large data sets with a parallel and distributed algorithm on a cluster.

As we will see later, in functional programming, there are two basic methods: map(), which is dedicated filtering and sorting, and reduce(), which is dedicated to doing an operation. As an example, to serve a group of people at a service window, you must first queue (map) and then attend them (reduce).

The term MapReduce was coined in 1995, when the Message Passing Interface was used to solve programming issues, as we will discuss later. Obviously, when Google made the implementation, it had only one use case in mind: web search.

It is important to note that Hadoop born in 2006 and grew up in an environment where MapReduce reigned. MapReduce was born with two characteristics that mark its life: high latency and batch mode; both make it incapable to withstand modern challenges.

As you can see in Table 2-2, Spark is different.

Table 2-2. Apache Spark /MapReduce Comparison

CONCEPT	Apache Spark	MapReduce
Written in	Scala/Akka	Java
Languages Supported	Java, Scala, Python, and R are first-class citizens.	Everything should be written using Java.
Storage Model	Keeps things in memory	Keeps things in disk. Takes a long time to write things to disk and read them back, making it slow and laborious.
I/O Model	Keeps things in memory without I/O. Operates on the same data quickly.	Requires a lot of I/O activity over disk.
Recovery	Runs the same task in seconds or minutes. Restart is not a problem.	Records everything in disk, allowing restart after failure
Knowledge	The abstraction is high; codification is intuitive.	Could write MapReduce jobs intelligently, avoiding overusing resources, but requires specialized knowledge of the platform.
Focus	Code describes how to process data. Implementation details are hidden.	Apache Hive programming goes into code to avoid running too many MapReduce jobs.
Efficiency	Abstracts all the implementation to run it as efficiently as possible.	Programmers write complex code to optimize each MapReduce job.
Abstraction	Abstracts things like a good high-level programming language. It is a powerful and expressive environment.	Code is hard to maintain over time.
Libraries	Adds libraries for machine learning, streaming, graph manipulation, and SQL.	Programmers need third-party tools and libraries, which makes work complex.
Streaming	Real-time stream processing out of the box.	Frameworks like Apache Storm needed; increased complexity.
Source Code Size	Scala programs have dozens of lines of code (LOC).	Java programs have hundreds of LOC.
Machine Learning	Spark ML	If you want to do machine learning, you have to separately integrate Mahout, H2O, or Onyx. You have to learn how it works, and how to build it on.
Graphs	Spark GraphX	If you want to do graph databases, you have to select from Giraph, TitanDB, Neo4J, or some other technologies. Integration is not seamless.

Apache Spark has these advantages:

Spark speeds up application development 10 to 100 times faster, making applications portable and extensible.
Scala can read Java code. Java code can be rewritten in Scala in a much smaller form factor that is much easier to read, repurpose, and maintain.
When the Apache Spark core is improved, all the machine learning and graphs libraries are improved too.
Integration is easier: the applications are easier to maintain and costs go down.

If an enterprise bets on one foundation, Spark is the best choice today.

Databricks (a company founded by the Apache Spark creators) lists the following use cases for Spark:

ETL and data integration
Business intelligence and interactive analytics
Advanced analytics and machine learning
Batch computation for high performance
Real-time stream processing

Some of the new use cases are just the old use cases done faster; although some use cases are totally new. There are some scenarios that just can’t be done with acceptable performance on MapReduce.

The Engine

It is important to recall that Spark is better at OLAP (online analytical processing), which are batch jobs and data mining. Spark is not suitable for OLTP (online transaction processing), such as numerous atomic transactions; for this type of processing, we strongly recommend Erlang (a beautiful language inspired in the actor’s model).

Apache Spark has five main components:

Spark Core
Spark SQL
Spark Streaming
Spark MLib
Spark GraphX

Each Spark library typically has an entire book dedicated to it. In this book, we try to simply tackle the Apache Spark essentials to meet the SMACK stack.

The role of Apache Spark on the SMACK stack is to act as the processor and provide real-time data analysis. It addresses the aggregation and analysis layers.

There are few open source alternatives to Spark. As we’ve mentioned, Apache Hadoop is the classic approach. The strongest modern adversary is the Apache Flink project, which is good to keep in mind.

The Model

Akka is a model , a toolkit, and a runtime for building distributed, resilient, and highly concurrent message-driven applications on the Java virtual machine. In 2009, the Akka toolkit was released as open source. Language bindings exist for both Java and Scala. We need to first analyze Akka in order to understand the Spark architecture. Akka was designed based on the actor concurrency models:

Actors are arranged hierarchically
Asynchronous message (data) passing
Fault tolerant
Customizable failure and detection strategies
Hierarchical supervision
Adaptive, predictive
Parallelized
Load balance

There are many Akka competitors; we make a special mention of Reactor. The actor model is the foundation of many frameworks and languages. The main languages that are based on the actor model (called functional languages) are Lisp, Scheme, Erlang, Haskell, and recently, Scala, Clojure, F#, and Elixir (a modern implementation of Erlang).

The Broker

Apache Kafka is a publish/subscribe message broker redesigned as a distributed commit log. In SMACK, Kafka is the data ingestion point, mainly on the application layer. Kafka takes data from applications and streams and processes them into the stack. Kafka is a distributed messaging system with high throughput. It handles massive data load and floods. It is the valve that regulates the pressure.

Apache Kafka inspects incoming data volume, which is fundamental for partitioning and distribution among the cluster nodes. Apache Kafka’s features include the following:

Automatic broker failover
Very high performance distributed messaging
Partitioning and Distribution across the cluster nodes
Data pipeline decoupling
A massive number of consumers are supported
Massive data load handling

Kafka is the champion among a lot of competitors in MOM (message-oriented middleware). In the MQ family, this includes ActiveMQ, ZeroMQ, IronMQ, and RabbitMQ. The best of all is RabbitMQ, which is made with Erlang.

The best alternative to Kafka is Apache Storm, which has a lot of integration with Apache Hadoop. Keep it in mind. Apache Kafka is here to stay.

The Storage

Apache Cassandra is a distributed database. It is the perfect choice when you need to escalate and need hyper-high availability with no sacrifice in performance. Cassandra was originally used on Facebook in 2008 to handle large amounts of data. It became a top-level Apache project in 2010. Cassandra handles the stack’s operational data. Cassandra can also be used to expose data to the application layer.

The following are the main features of Apache Cassandra:

Extremely fast and scalable
Multi data center, no single point of failure
Survives when multiple nodes fault
Easy to operate
Flexible data modeling
Automatic and configurable replication
Ideal for real-time ingestion
Has a great Apache based community

There are a lot of Cassandra competitors, including DynamoDB (powered by Amazon; it’s contending in the NoSQL battlefield), Apache HBase (the best-known database implementation of Hadoop), Riak (made by the Basho samurais; it’s a powerful Erlang database), CouchBase, Apache CouchDB, MongoDB, Cloudant, and Redis.

The Container

Apache Mesos is a distributed systems kernel that is easy to build and effective to run. Mesos is an abstraction layer over all computer resources (CPU, memory, storage) on the machines (physical or virtual), enabling elastic distributed systems and fault tolerance. Mesos was designed with the Linux kernel principles at a higher abstraction level. It was first presented as Nexus in 2009. In 2011, it was relaunched by Matei Zaharia under its current name. Mesos is the base of three frameworks:

Apache Aurora
Chronos
Marathon

In SMACK, Mesos orchestrates components and manages resources. It is the secret for horizontal cluster scalation. Usually, Apache Mesos is combined with Kubernetes (the competitor used by the Google Cloud Platform) or with Docker (as you will see, more than a competitor, it is a complement to Mesos). The equivalent in Hadoop is Apache Yarn.

Summary

This chapter, like the previous one, was full of theory. We reviewed the fundamental SMACK diagram as well as Spark’s advantages over traditional big data technologies such as Hadoop and MapReduce. We also visited every technology in the SMACK stack, briefly presented each tool’s potential, and most importantly, we discussed the actual alternatives for each technology. The upcoming chapters go into greater depth on each of these technologies. We will explore the connectors and the integration practices, and link techniques, as well as describe alternatives to every situation.

Footnotes

1 https://mesosphere.com/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Big Data, Big Solutions

Create new playlist

Sign In

Sign Up