In Chapter 1, we answered the Why?. In this chapter, we will answer the How?. When you understand the Why, the answer to the How happens in only a matter of time.
This chapter covers the following topics:
Traditional vs. modern (big) data
SMACK in a nutshell
Spark, the engine
Mesos, the container
Akka, the model
Cassandra, the storage
Kafka, the broker
Traditional vs. Modern (Big) Data
Is time quantized? Is there an indivisible amount of time that cannot be divided? Until now, the correct answer to these questions was “Nobody knows.” The only certain thing is that on a human scale, life doesn’t happen in batch mode.
Many systems are monitoring a continuous stream of events: weather events, GPS signals, vital signs, logs, device metrics…. The list is endless. The natural way to collect and analyze this information is as a stream of data.
Handling data as streams is the correct way to model this behavior, but until recently, this methodology was very difficult to do well. The previous rates of messages were in the range of thousands of messages per second—the new technologies discussed in this book can deliver rates of millions of messages per second.
The point is this: streaming data is not a matter for very specialized computer science projects; stream-based data is becoming the rule for data-driven companies.
Table 2-1 compares the three approaches: traditional data, traditional big data, and modern big data.
Table 2-1. Traditional Data, Traditional Big Data, and Modern Big Data Approaches
CONCEPT | TRADITIONAL DATA | TRADITIONAL BIG DATA | MODERN BIG DATA |
---|---|---|---|
Person | • IT oriented | • IT oriented | • Business oriented |
Roles | • Developer | • Data engineer | • Business user |
• Data architect | • Data scientist | ||
Data Sources | • Relational | • Relational | • Relational |
• Files | • Files | • Files | |
• Message queues | • Message queues | • Message queues | |
• Data service | • Data service | ||
• NoSQL | |||
Data Processing | • Application server | • Application server | • Application server |
• ETL | • ETL | • ETL | |
• Hadoop | • Hadoop | ||
• Spark | |||
Metadata | • Limited by IT | • Limited by model | • Automatically generated |
• Context enriched | |||
• Business oriented | |||
• Dictionary based | |||
User interface | • Self-made | • Self-made | • Self-made |
• Developer skills required | • Developer skills required | • Built by business users | |
• Tools guided | |||
Use Cases | • Data migration | • Data lakes | • Self-service |
• Data movement | • Data hubs | • Internet of Things | |
• Replication | • Data warehouse offloading | • Data as a Service | |
Open Source Technologies | • Fully embraced | • Minimal | • TCO rules |
Tools Maturity | • High | • Medium | • Low |
• Enterprise | • Enterprise | • Evolving | |
Business Agility | • Low | • Medium | • Extremely high |
Automation level | • Low | • Medium | • High |
Governance | • IT governed | • Business governed | • End-user governed |
Problem Resolution | • IT personnel solved | • IT personnel solved | • Timely or die |
Collaboration | • Medium | • Low | • Extremely high |
Productivity/Time to Market | • Slower | • Slower | • Highly productive |
• Faster time to market | |||
Integration Analysis | • Minimal | • Medium | • Modeled by analytical transformations |
Real-time | • Minimal real time | • Minimal real time | • In real time or die |
Data Access | • Primarily batch | • Batch | • Micro batch |
Modern technologies and architectures allow you to build systems more easily and efficiently, and to produce a better model of the way business processes take place. We will explain the real value of a streaming architecture. The possibilities are vast.
Apache Spark is not a replacement for Hadoop. Spark is a computing engine, whereas Hadoop is a complete stack for storage, cluster management, and computing tools. Spark runs well over Hadoop.
Hadoop is a ten-year-old technology. Today, we see the rising of many deployments that are not on Hadoop, including deployments on NoSQL stores (like Cassandra) and deployments directly against cloud storage (e.g., Amazon S3). In this aspect, Spark is reaching a broader audience than Hadoop.
SMACK in a Nutshell
If you poll several IT people, we agree on a few things, including that we are always searching for a new acronym.
SMACK, as you already know, stands for Spark, Mesos, Akka, Cassandra, and Kafka. They are all open source technologies and all are Apache software projects, except Akka. The SMACK acronym was coined by Mesosphere, a company that, in collaboration with Cisco, bundled these technologies together in a product called Infinity, which was designed to solve some big data challenges where the streaming is fundamental.1
Big data architecture is required in the daily operation of many companies, but there are a lot of sources talking about each technology separately.
Let’s discuss the full stack and how to make the integration.
This book is a cookbook on how to integrate each technology in the most successful big data stack. We talk about the five main concepts of big data architecture and how to integrate/replace/reinforce every technology:
Spark: The engine
Mesos: The container
Akka: The model
Cassandra: The storage
Kafka: The message broker
Figure 2-1 represents the reference diagram for the whole book.
Figure 2-1. SMACK at a glance
Apache Spark vs. MapReduce
MapReduce is a programming model for processing large data sets with a parallel and distributed algorithm on a cluster.
As we will see later, in functional programming, there are two basic methods: map(), which is dedicated filtering and sorting, and reduce(), which is dedicated to doing an operation. As an example, to serve a group of people at a service window, you must first queue (map) and then attend them (reduce).
The term MapReduce was coined in 1995, when the Message Passing Interface was used to solve programming issues, as we will discuss later. Obviously, when Google made the implementation, it had only one use case in mind: web search.
It is important to note that Hadoop born in 2006 and grew up in an environment where MapReduce reigned. MapReduce was born with two characteristics that mark its life: high latency and batch mode; both make it incapable to withstand modern challenges.
As you can see in Table 2-2, Spark is different.
Table 2-2. Apache Spark /MapReduce Comparison
CONCEPT | Apache Spark | MapReduce |
---|---|---|
Written in | Scala/Akka | Java |
Languages Supported | Java, Scala, Python, and R are first-class citizens. | Everything should be written using Java. |
Storage Model | Keeps things in memory | Keeps things in disk. Takes a long time to write things to disk and read them back, making it slow and laborious. |
I/O Model | Keeps things in memory without I/O. Operates on the same data quickly. | Requires a lot of I/O activity over disk. |
Recovery | Runs the same task in seconds or minutes. Restart is not a problem. | Records everything in disk, allowing restart after failure |
Knowledge | The abstraction is high; codification is intuitive. | Could write MapReduce jobs intelligently, avoiding overusing resources, but requires specialized knowledge of the platform. |
Focus | Code describes how to process data. Implementation details are hidden. | Apache Hive programming goes into code to avoid running too many MapReduce jobs. |
Efficiency | Abstracts all the implementation to run it as efficiently as possible. | Programmers write complex code to optimize each MapReduce job. |
Abstraction | Abstracts things like a good high-level programming language. It is a powerful and expressive environment. | Code is hard to maintain over time. |
Libraries | Adds libraries for machine learning, streaming, graph manipulation, and SQL. | Programmers need third-party tools and libraries, which makes work complex. |
Streaming | Real-time stream processing out of the box. | Frameworks like Apache Storm needed; increased complexity. |
Source Code Size | Scala programs have dozens of lines of code (LOC). | Java programs have hundreds of LOC. |
Machine Learning | Spark ML | If you want to do machine learning, you have to separately integrate Mahout, H2O, or Onyx. You have to learn how it works, and how to build it on. |
Graphs | Spark GraphX | If you want to do graph databases, you have to select from Giraph, TitanDB, Neo4J, or some other technologies. Integration is not seamless. |
Apache Spark has these advantages:
Spark speeds up application development 10 to 100 times faster, making applications portable and extensible.
Scala can read Java code. Java code can be rewritten in Scala in a much smaller form factor that is much easier to read, repurpose, and maintain.
When the Apache Spark core is improved, all the machine learning and graphs libraries are improved too.
Integration is easier: the applications are easier to maintain and costs go down.
If an enterprise bets on one foundation, Spark is the best choice today.
Databricks (a company founded by the Apache Spark creators) lists the following use cases for Spark:
ETL and data integration
Business intelligence and interactive analytics
Advanced analytics and machine learning
Batch computation for high performance
Real-time stream processing
Some of the new use cases are just the old use cases done faster; although some use cases are totally new. There are some scenarios that just can’t be done with acceptable performance on MapReduce.
The Engine
It is important to recall that Spark is better at OLAP (online analytical processing), which are batch jobs and data mining. Spark is not suitable for OLTP (online transaction processing), such as numerous atomic transactions; for this type of processing, we strongly recommend Erlang (a beautiful language inspired in the actor’s model).
Apache Spark has five main components:
Spark Core
Spark SQL
Spark Streaming
Spark MLib
Spark GraphX
Each Spark library typically has an entire book dedicated to it. In this book, we try to simply tackle the Apache Spark essentials to meet the SMACK stack.
The role of Apache Spark on the SMACK stack is to act as the processor and provide real-time data analysis. It addresses the aggregation and analysis layers.
There are few open source alternatives to Spark. As we’ve mentioned, Apache Hadoop is the classic approach. The strongest modern adversary is the Apache Flink project, which is good to keep in mind.
The Model
Akka is a model , a toolkit, and a runtime for building distributed, resilient, and highly concurrent message-driven applications on the Java virtual machine. In 2009, the Akka toolkit was released as open source. Language bindings exist for both Java and Scala. We need to first analyze Akka in order to understand the Spark architecture. Akka was designed based on the actor concurrency models:
Actors are arranged hierarchically
Asynchronous message (data) passing
Fault tolerant
Customizable failure and detection strategies
Hierarchical supervision
Adaptive, predictive
Parallelized
Load balance
There are many Akka competitors; we make a special mention of Reactor. The actor model is the foundation of many frameworks and languages. The main languages that are based on the actor model (called functional languages) are Lisp, Scheme, Erlang, Haskell, and recently, Scala, Clojure, F#, and Elixir (a modern implementation of Erlang).
The Broker
Apache Kafka is a publish/subscribe message broker redesigned as a distributed commit log. In SMACK, Kafka is the data ingestion point, mainly on the application layer. Kafka takes data from applications and streams and processes them into the stack. Kafka is a distributed messaging system with high throughput. It handles massive data load and floods. It is the valve that regulates the pressure.
Apache Kafka inspects incoming data volume, which is fundamental for partitioning and distribution among the cluster nodes. Apache Kafka’s features include the following:
Automatic broker failover
Very high performance distributed messaging
Partitioning and Distribution across the cluster nodes
Data pipeline decoupling
A massive number of consumers are supported
Massive data load handling
Kafka is the champion among a lot of competitors in MOM (message-oriented middleware). In the MQ family, this includes ActiveMQ, ZeroMQ, IronMQ, and RabbitMQ. The best of all is RabbitMQ, which is made with Erlang.
The best alternative to Kafka is Apache Storm, which has a lot of integration with Apache Hadoop. Keep it in mind. Apache Kafka is here to stay.
The Storage
Apache Cassandra is a distributed database. It is the perfect choice when you need to escalate and need hyper-high availability with no sacrifice in performance. Cassandra was originally used on Facebook in 2008 to handle large amounts of data. It became a top-level Apache project in 2010. Cassandra handles the stack’s operational data. Cassandra can also be used to expose data to the application layer.
The following are the main features of Apache Cassandra:
Extremely fast and scalable
Multi data center, no single point of failure
Survives when multiple nodes fault
Easy to operate
Flexible data modeling
Automatic and configurable replication
Ideal for real-time ingestion
Has a great Apache based community
There are a lot of Cassandra competitors, including DynamoDB (powered by Amazon; it’s contending in the NoSQL battlefield), Apache HBase (the best-known database implementation of Hadoop), Riak (made by the Basho samurais; it’s a powerful Erlang database), CouchBase, Apache CouchDB, MongoDB, Cloudant, and Redis.
The Container
Apache Mesos is a distributed systems kernel that is easy to build and effective to run. Mesos is an abstraction layer over all computer resources (CPU, memory, storage) on the machines (physical or virtual), enabling elastic distributed systems and fault tolerance. Mesos was designed with the Linux kernel principles at a higher abstraction level. It was first presented as Nexus in 2009. In 2011, it was relaunched by Matei Zaharia under its current name. Mesos is the base of three frameworks:
Apache Aurora
Chronos
Marathon
In SMACK, Mesos orchestrates components and manages resources. It is the secret for horizontal cluster scalation. Usually, Apache Mesos is combined with Kubernetes (the competitor used by the Google Cloud Platform) or with Docker (as you will see, more than a competitor, it is a complement to Mesos). The equivalent in Hadoop is Apache Yarn.
Summary
This chapter, like the previous one, was full of theory. We reviewed the fundamental SMACK diagram as well as Spark’s advantages over traditional big data technologies such as Hadoop and MapReduce. We also visited every technology in the SMACK stack, briefly presented each tool’s potential, and most importantly, we discussed the actual alternatives for each technology. The upcoming chapters go into greater depth on each of these technologies. We will explore the connectors and the integration practices, and link techniques, as well as describe alternatives to every situation.