Raul Estrada and

Isaac Ruiz

Big Data SMACK

A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka

Raul Estrada

Mexico City, Mexico

Isaac Ruiz

Mexico City, Mexico

Any source code or other supplementary materials referenced by the author in this text is available to readers at www.apress.com . For detailed information about how to locate your book’s source code, go to www.apress.com/source-code/ .

ISBN 978-1-4842-2174-7

e-ISBN 978-1-4842-2175-4

DOI 10.1007/978-1-4842-2175-4

Library of Congress Control Number: 2016954634

© Raul Estrada and Isaac Ruiz 2016

Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka

Managing Director: Welmoed Spahr

Acquisitions Editor: Susan McDermott

Developmental Editor: Laura Berendson

Technical Reviewer: Rogelio Vizcaino

Editorial Board: Steve Anglin, Pramila Balen, Laura Berendson, Aaron Black, Louise Corrigan, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal, James Markham, Susan McDermott, Matthew Moodie, Natalie Pao, Gwenan Spearing

Coordinating Editor: Rita Fernando

Copy Editor: Kim Burton-Weisman

Compositor: SPi Global

Indexer: SPi Global

Cover Image: Designed by Harryarts - Freepik.com

For information on translations, please e-mail [email protected] , or visit www.apress.com .

Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales .

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springer.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

I dedicate this book to my mom and all the masters out there.

—Raúl Estrada

For all Binnizá people.

—Isaac Ruiz

Introduction

During 2014, 2015, and 2016, surveys show that among all software developers, those with higher wages are the data engineers, the data scientists, and the data architects.

This is because there is a huge demand for technical professionals in data; unfortunately for large organizations and fortunately for developers, there is a very low offering.

Traditionally, large volumes of information have been handled by specialized scientists and people with a PhD from the most prestigious universities. And this is due to the popular belief that not all of us have access to large volumes of corporate data or large enterprise production environments.

Apache Spark is disrupting the data industry for two reasons. The first is because it is an open source project. In the last century, companies like IBM, Microsoft, SAP, and Oracle were the only ones capable of handling large volumes of data, and today there is so much competition between them, that disseminating designs or platform algorithms is strictly forbidden. Thus, the benefits of open source become stronger because the contributions of so many people make free tools more powerful than the proprietary ones.

The second reason is that you do not need a production environment with large volumes of data or large laboratories to develop in Apache Spark. Apache Spark can be installed on a laptop easily and the development made there can be exported easily to enterprise environments with large volumes of data. Apache Spark also makes the data development free and accessible to startups and little companies.

If you are reading this book, it is for two reasons: either you want to be among the best paid IT professionals, or you already are and you want to learn how today’s trends will become requirements in the not too distant future.

In this book, we explain how dominate the SMACK stack, which is also called the Spark++, because it seems to be the open stack that will most likely succeed in the near future.

Acknowledgments

We want to say thanks to our acquisitions editor, Susan McDermott, who believed in this project from the beginning; without her help, it would not have started.

We also thank Rita Fernando and Laura Berendson; without their effort and patience, it would not have been possible to write this book.

We want to thank our technical reviewer, Rogelio Vizcaino; without him, the project would not have been a success.

We also want to thank all the heroes who contribute open source projects, specifically with Spark, Mesos, Akka, Cassandra and Kafka, and special recognition to those who develop the open source connectors between these technologies.

We also thank all the people who have educated us and shown us the way throughout our lives.

Contents

  1. Part I: Introduction
    1. Chapter 1:​ Big Data, Big Challenges
      1. Big Data Problems
      2. Infrastructure Needs
      3. ETL
      4. Lambda Architecture
      5. Hadoop
      6. Data Center Operation
        1. The Open Source Reign
        2. The Data Store Diversification
      7. Is SMACK the Solution?​
    2. Chapter 2:​ Big Data, Big Solutions
      1. Traditional vs.​ Modern (Big) Data
      2. SMACK in a Nutshell
      3. Apache Spark vs.​ MapReduce
      4. The Engine
      5. The Model
      6. The Broker
      7. The Storage
      8. The Container
      9. Summary
  2. Part II: Playing SMACK
    1. Chapter 3:​ The Language:​ Scala
      1. Functional Programming
        1. Predicate
        2. Literal Functions
        3. Implicit Loops
      2. Collections Hierarchy
        1. Sequences
        2. Maps
        3. Sets
      3. Choosing Collections
        1. Sequences
        2. Maps
        3. Sets
      4. Traversing
        1. foreach
        2. for
        3. Iterators
      5. Mapping
      6. Flattening
      7. Filtering
      8. Extracting
      9. Splitting
      10. Unicity
      11. Merging
      12. Lazy Views
      13. Sorting
      14. Streams
      15. Arrays
      16. ArrayBuffers
      17. Queues
      18. Stacks
      19. Ranges
      20. Summary
    2. Chapter 4:​ The Model:​ Akka
      1. The Actor Model
        1. Threads and Labyrinths
        2. Actors 101
      2. Installing Akka
      3. Akka Actors
        1. Actors
        2. Actor System
        3. Actor Reference
        4. Actor Communication
        5. Actor Lifecycle
        6. Starting Actors
        7. Stopping Actors
        8. Killing Actors
        9. Shutting down the Actor System
        10. Actor Monitoring
        11. Looking up Actors
        12. Actor Code of Conduct
      4. Summary
    3. Chapter 5:​ Storage:​ Apache Cassandra
      1. Once Upon a Time
        1. Modern Cassandra
      2. NoSQL Everywhere
      3. The Memory Value
        1. Key-Value and Column
      4. Why Cassandra?​
        1. The Data Model
      5. Cassandra 101
        1. Installation
      6. Beyond the Basics
        1. Client-Server
        2. Other Clients
        3. Apache Spark-Cassandra Connector
        4. Installing the Connector
        5. Establishing the Connection
      7. More Than One Is Better
        1. cassandra.​yaml
        2. Setting the Cluster
      8. Putting It All Together
    4. Chapter 6:​ The Engine:​ Apache Spark
      1. Introducing Spark
        1. Apache Spark Download
        2. Let’s Kick the Tires
        3. Loading a Data File
        4. Loading Data from S3
      2. Spark Architecture
        1. SparkContext
        2. Creating a SparkContext
        3. SparkContext Metadata
        4. SparkContext Methods
      3. Working with RDDs
        1. Standalone Apps
        2. RDD Operations
      4. Spark in Cluster Mode
        1. Runtime Architecture
        2. Driver
        3. Executor
        4. Cluster Manager
        5. Program Execution
        6. Application Deployment
        7. Running in Cluster Mode
        8. Spark Standalone Mode
        9. Running Spark on EC2
        10. Running Spark on Mesos
        11. Submitting Our Application
        12. Configuring Resources
        13. High Availability
      5. Spark Streaming
        1. Spark Streaming Architecture
        2. Transformations
        3. 24/​7 Spark Streaming
        4. Checkpointing
        5. Spark Streaming Performance
      6. Summary
    5. Chapter 7:​ The Manager:​ Apache Mesos
      1. Divide et Impera (Divide and Rule)
      2. Distributed Systems
        1. Why Are They Important?​
      3. It Is Difficult to Have a Distributed System
      4. Ta-dah!! Apache Mesos
      5. Mesos Framework
        1. Architecture
      6. Mesos 101
        1. Installation
        2. Teaming
      7. Let’s Talk About Clusters
        1. Apache Mesos and Apache Kafka
        2. Mesos and Apache Spark
        3. The Best Is Yet to Come
        4. Summary
    6. Chapter 8:​ The Broker:​ Apache Kafka
      1. Kafka Introduction
        1. Born in the Fast Data Era
        2. Use Cases
      2. Kafka Installation
        1. Installing Java
        2. Installing Kafka
        3. Importing Kafka
      3. Kafka in Cluster
        1. Single Node–Single Broker Cluster
        2. Single Node–Multiple Broker Cluster
        3. Multiple Node–Multiple Broker Cluster
        4. Broker Properties
      4. Kafka Architecture
        1. Log Compaction
        2. Kafka Design
        3. Message Compression
        4. Replication
      5. Kafka Producers
        1. Producer API
        2. Scala Producers
        3. Producers with Custom Partitioning
        4. Producer Properties
      6. Kafka Consumers
        1. Consumer API
        2. Simple Scala Consumers
        3. Multithread Scala Consumers
        4. Consumer Properties
      7. Kafka Integration
        1. Integration with Apache Spark
      8. Kafka Administration
        1. Cluster Tools
        2. Adding Servers
      9. Summary
  3. Part III: Improving SMACK
    1. Chapter 9:​ Fast Data Patterns
      1. Fast Data
        1. Fast Data at a Glance
        2. Beyond Big Data
        3. Fast Data Characteristics
        4. Fast Data and Hadoop
        5. Data Enrichment
        6. Queries
      2. ACID vs.​ CAP
        1. ACID Properties
        2. CAP Theorem
        3. Consistency
        4. CRDT
      3. Integrating Streaming and Transactions
        1. Pattern 1:​ Reject Requests Beyond a Threshold
        2. Pattern 2:​ Alerting on Predicted Trends Variation
        3. When Not to Integrate Streaming and Transactions
        4. Aggregation Techniques
      4. Streaming Transformations
        1. Pattern 3:​ Use Streaming Transformations to Avoid ETL
        2. Pattern 4:​ Connect Big Data Analytics to Real-Time Stream Processing
        3. Pattern 5:​ Use Loose Coupling to Improve Reliability
        4. Points to Consider
      5. Fault Recovery Strategies
        1. Pattern 6:​ At-Most-Once Delivery
        2. Pattern 7:​ At-Least-Once Delivery
        3. Pattern 8:​ Exactly-Once Delivery
      6. Tag Data Identifiers
        1. Pattern 9:​ Use Upserts over Inserts
        2. Pattern 10:​ Tag Data with Unique Identifiers
        3. Pattern 11:​ Use Kafka Offsets as Unique Identifiers
        4. When to Avoid Idempotency
        5. Example:​ Switch Processing
      7. Summary
    2. Chapter 10:​ Data Pipelines
      1. Data Pipeline Strategies and Principles
        1. Asynchronous Message Passing
        2. Consensus and Gossip
        3. Data Locality
        4. Failure Detection
        5. Fault Tolerance/​No Single Point of Failure
        6. Isolation
        7. Location Transparency
        8. Parallelism
        9. Partition for Scale
        10. Replay for Any Point of Failure
        11. Replicate for Resiliency
        12. Scalable Infrastructure
        13. Share Nothing/​Masterless
        14. Dynamo Systems Principles
      2. Spark and Cassandra
        1. Spark Streaming with Cassandra
        2. Saving Data
        3. Saving Datasets to Cassandra
        4. Saving a Collection of Tuples
        5. Saving a Collection of Objects
        6. Modifying CQL Collections
        7. Saving Objects of Cassandra User-Defined Types
        8. Converting Scala Options to Cassandra Options
        9. Saving RDDs as New Tables
      3. Akka and Kafka
      4. Akka and Cassandra
        1. Writing to Cassandra
        2. Reading from Cassandra
        3. Connecting to Cassandra
        4. Scanning Tweets
        5. Testing TweetScannerActo​r
      5. Akka and Spark
      6. Kafka and Cassandra
        1. CQL Types Supported
        2. Cassandra Sink
      7. Summary
    3. Chapter 11:​ Glossary
      1. ACID
      2. agent
      3. API
      4. BI
      5. big data
      6. CAP
      7. CEP
      8. client-server
      9. cloud
      10. cluster
      11. column family
      12. coordinator
      13. CQL
      14. CQLS
      15. concurrency
      16. commutative operations
      17. CRDTs
      18. dashboard
      19. data feed
      20. DBMS
      21. determinism
      22. dimension data
      23. distributed computing
      24. driver
      25. ETL
      26. exabyte
      27. exponential backoff
      28. failover
      29. fast data
      30. gossip
      31. graph database
      32. HDSF
      33. HTAP
      34. IaaS
      35. idempotence
      36. IMDG
      37. IoT
      38. key-value
      39. keyspace
      40. latency
      41. master-slave
      42. metadata
      43. NoSQL
      44. operational analytics
      45. RDBMS
      46. real-time analytics
      47. replication
      48. PaaS
      49. probabilistic data structures
      50. SaaS
      51. scalability
      52. shared nothing
      53. Spark-Cassandra Connector
      54. streaming analytics
      55. synchronization
      56. unstructured data
  4. Index

About the Authors and About the Technical Reviewer

About the Authors

Raul Estrada has been a programmer since 1996 and a Java developer since the year 2000. He loves functional languages like Elixir, Scala, Clojure, and Haskell. With more than 12 years of experience in high availability and enterprise software, he has designed and implemented architectures since 2003. He has been enterprise architect for BEA Systems and Oracle Inc., but he also enjoys mobile programming and game development. Now he is focused on open source projects related to data pipelining like Apache Spark, Apache Kafka, Apache Flink, and Apache Beam.

Isaac Ruiz has been a Java programmer since 2001, and a consultant and an architect since 2003. He has participated in projects in different areas and varied scopes (education, communications, retail, and others). He specializes in systems integration, particularly in the financial sector. Ruiz is a supporter of free software and he likes to experiment with new technologies (frameworks, languages, and methods).

About the Technical Reviewer

Rogelio Vizcaino has been a programming professionally for ten years, and hacking a little longer than that. Currently he is a JEE and solutions architect on a consultancy basis for one of the major banking institutions in his country. Educated as an electronic systems engineer, performance and footprint are more than “desirable treats” in software to him. Ironically, the once disliked tasks in database maintenance became his mainstay skills through much effort in the design and development of both relational and non-relational databases since the start of his professional practice—and the good luck of finding great masters to work with during the journey. With most of his experience in the enterprise financial sector, Vizcaino’s heart is with the Web. He keeps track of web-related technologies and standards, where he discovered the delights of programming back in the late 1990s. Vizcaino considers himself a programmer before an architect, engineer, or developer; “programming” is an all-encompassing term and should be used with pride. Above all, he likes to learn, to create new things, and to fix broken ones.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset