Raul Estrada and
Isaac Ruiz
Raul Estrada
Mexico City, Mexico
Isaac Ruiz
Mexico City, Mexico
Any source code or other supplementary materials referenced by the author in this text is available to readers at www.apress.com . For detailed information about how to locate your book’s source code, go to www.apress.com/source-code/ .
ISBN 978-1-4842-2174-7
e-ISBN 978-1-4842-2175-4
DOI 10.1007/978-1-4842-2175-4
Library of Congress Control Number: 2016954634
© Raul Estrada and Isaac Ruiz 2016
Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka
Managing Director: Welmoed Spahr
Acquisitions Editor: Susan McDermott
Developmental Editor: Laura Berendson
Technical Reviewer: Rogelio Vizcaino
Editorial Board: Steve Anglin, Pramila Balen, Laura Berendson, Aaron Black, Louise Corrigan, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal, James Markham, Susan McDermott, Matthew Moodie, Natalie Pao, Gwenan Spearing
Coordinating Editor: Rita Fernando
Copy Editor: Kim Burton-Weisman
Compositor: SPi Global
Indexer: SPi Global
Cover Image: Designed by Harryarts - Freepik.com
For information on translations, please e-mail [email protected] , or visit www.apress.com .
Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales .
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Printed on acid-free paper
I dedicate this book to my mom and all the masters out there.
—Raúl Estrada
For all Binnizá people.
—Isaac Ruiz
During 2014, 2015, and 2016, surveys show that among all software developers, those with higher wages are the data engineers, the data scientists, and the data architects.
This is because there is a huge demand for technical professionals in data; unfortunately for large organizations and fortunately for developers, there is a very low offering.
Traditionally, large volumes of information have been handled by specialized scientists and people with a PhD from the most prestigious universities. And this is due to the popular belief that not all of us have access to large volumes of corporate data or large enterprise production environments.
Apache Spark is disrupting the data industry for two reasons. The first is because it is an open source project. In the last century, companies like IBM, Microsoft, SAP, and Oracle were the only ones capable of handling large volumes of data, and today there is so much competition between them, that disseminating designs or platform algorithms is strictly forbidden. Thus, the benefits of open source become stronger because the contributions of so many people make free tools more powerful than the proprietary ones.
The second reason is that you do not need a production environment with large volumes of data or large laboratories to develop in Apache Spark. Apache Spark can be installed on a laptop easily and the development made there can be exported easily to enterprise environments with large volumes of data. Apache Spark also makes the data development free and accessible to startups and little companies.
If you are reading this book, it is for two reasons: either you want to be among the best paid IT professionals, or you already are and you want to learn how today’s trends will become requirements in the not too distant future.
In this book, we explain how dominate the SMACK stack, which is also called the Spark++, because it seems to be the open stack that will most likely succeed in the near future.
We want to say thanks to our acquisitions editor, Susan McDermott, who believed in this project from the beginning; without her help, it would not have started.
We also thank Rita Fernando and Laura Berendson; without their effort and patience, it would not have been possible to write this book.
We want to thank our technical reviewer, Rogelio Vizcaino; without him, the project would not have been a success.
We also want to thank all the heroes who contribute open source projects, specifically with Spark, Mesos, Akka, Cassandra and Kafka, and special recognition to those who develop the open source connectors between these technologies.
We also thank all the people who have educated us and shown us the way throughout our lives.
Raul Estrada has been a programmer since 1996 and a Java developer since the year 2000. He loves functional languages like Elixir, Scala, Clojure, and Haskell. With more than 12 years of experience in high availability and enterprise software, he has designed and implemented architectures since 2003. He has been enterprise architect for BEA Systems and Oracle Inc., but he also enjoys mobile programming and game development. Now he is focused on open source projects related to data pipelining like Apache Spark, Apache Kafka, Apache Flink, and Apache Beam.
Isaac Ruiz has been a Java programmer since 2001, and a consultant and an architect since 2003. He has participated in projects in different areas and varied scopes (education, communications, retail, and others). He specializes in systems integration, particularly in the financial sector. Ruiz is a supporter of free software and he likes to experiment with new technologies (frameworks, languages, and methods).
Rogelio Vizcaino has been a programming professionally for ten years, and hacking a little longer than that. Currently he is a JEE and solutions architect on a consultancy basis for one of the major banking institutions in his country. Educated as an electronic systems engineer, performance and footprint are more than “desirable treats” in software to him. Ironically, the once disliked tasks in database maintenance became his mainstay skills through much effort in the design and development of both relational and non-relational databases since the start of his professional practice—and the good luck of finding great masters to work with during the journey. With most of his experience in the enterprise financial sector, Vizcaino’s heart is with the Web. He keeps track of web-related technologies and standards, where he discovered the delights of programming back in the late 1990s. Vizcaino considers himself a programmer before an architect, engineer, or developer; “programming” is an all-encompassing term and should be used with pride. Above all, he likes to learn, to create new things, and to fix broken ones.