Preface

At this point in time, machine learning (ML) requires little introduction: it is both pervasive and transformative to businesses, non-profits, and scientific organizations. ML is built on data. We are all aware of the exponential growth of data collected each year, and the growing diversity of sources that generate this data. This book is about leveraging these massive data volumes to do ML. We call this machine learning at scale and define it on three pillars: building high-quality models on large to massive datasets, deploying them for scoring in diverse enterprise environments, and navigating multiple stakeholder concerns along the way. Here, scale considers both data volume and enterprise context, model building, and model deployment. In this book, we will show you, in practical terms, how H2O overcomes the many challenges of performing ML at scale.

The book starts with a general overview of the challenges of performing ML at scale, and how the H2O framework overcomes these challenges while producing high-quality models and enterprise-grade deployments. From there, it transitions to advanced treatment of model-building techniques and model deployment patterns using H2O at Scale. We then look at its technological underpinnings from the perspective of multiple enterprise stakeholders who need to understand, deploy, and maintain this system, and show how this relates to data scientist activities and needs. We finish by showing how H2O at Scale can be implemented on its own or as part of the larger and richly featured H2O AI Cloud platform, where it takes on exciting new levels of ML possibilities and business value.

By the end of this book, you'll have the knowledge needed to build high-quality explainable ML models from massive datasets, deploy these models to a great diversity of enterprise systems, and assemble state-of-the-art ML solutions that achieve unique forms of business value.

Who this book is for

This book is written for data scientists, ML engineers, system administrators, enterprise architects, and curious technologists who want to build and deploy ML models at scale using H2O. Those already familiar with H2O will learn advanced model-building techniques and deployment patterns, as well as the details of how H2O works under the hood. Students with knowledge of ML but little or no work experience will gain an understanding of how it is performed in the world of large enterprises. Basic knowledge of ML is recommended, and an understanding of Python is needed to follow code examples.

What this book covers

Chapter 1, Opportunities and Challenges, sets the context of the book. We will first define ML at scale around three areas: building high-quality models on large to massive datasets, deploying them for scoring in diverse enterprise environments, and navigating multiple stakeholder concerns along the way. We will then recognize the vast business opportunities and execution challenges of ML in this context. In this light, you will be introduced to how H2O overcomes these challenges with its H2O-3, Sparkling Water, Enterprise Steam, and MOJO technologies that form the H2O at Scale framework.

Chapter 2, Platform Components and Key Concepts, overviews each H2O component by describing where it fits in the ML life cycle, what its key features are, and how it overcomes the challenges of ML at scale. We then distill several key concepts from this overview. The goal of this chapter is to provide you with a foundational knowledge of how H2O at Scale works before you learn how to implement it.

Chapter 3, Fundamental Workflow – Data to Deployable Model, shows the minimal steps needed to build and deploy models with the H2O at Scale framework. Think of this as a Hello World example, with each step explained. You have alternatives to implementing these steps, and they will be explored. At this point in the book, we will end our general overview and move on to advanced topics.

Chapter 4, H2O Model Building at Scale – Capability Articulation, starts our model-building focus and is of interest primarily to data scientists. In this chapter, we familiarize ourselves with H2O's extensive range of modeling capabilities, from data ingestion and manipulation to algorithms, model training, evaluation, and explainability techniques. Think of this chapter as the what of H2O model building, and the next chapters as an advanced treatment of the how and why.

Chapter 5, Advanced Model Building – Part 1, introduces you to the advanced model-building topics that a data scientist considers when building enterprise-grade models. We discuss data-splitting options, compare modeling algorithms, present a two-stage grid-search strategy for hyperparameter optimization, introduce H2O AutoML for automatically fitting multiple algorithms to data, and investigate feature engineering options for improving model performance. By the end of this chapter, you should be able to build an enterprise-scale, optimized, and predictive model using one or more supervised learning algorithms available within H2O.

Chapter 6, Advanced Model Building – Part II, continues our advanced model-building topics by showing how to build H2O supervised learning models within an Apache Spark pipeline, reviewing H2O's unsupervised learning methods, discussing best practices for updating H2O models, and introducing requirements to ensure H2O model reproducibility.

Chapter 7, Understanding ML Models, outlines a set of capabilities within H2O for explaining ML models. Building a model that predicts well is not enough. A critical step before putting any model into production is understanding how it makes decisions. We discuss selecting appropriate model metrics, using multiple diagnostics to build trust in a model, and using global and local explanations with model performance metrics to choose the best among a set of candidate models. This includes an evaluation of tradeoffs between model performance, speed of scoring, and assumptions met in a candidate model.

Chapter 8, Putting It All Together, starts the way most data science projects do: with raw data and a general business objective. We refine both the data and problem statement to be one that is relevant to the business and can be answered by the available data. We engineer a variety of features, creating and evaluating multiple candidate models until we arrive at a final model. We evaluate the final model and illustrate the preparation steps required for model deployment. The treatment in this chapter accurately reflects the job of a data scientist in the enterprise.

Chapter 9, Production Scoring and the H2O MOJO, starts our focus on model deployment. ML engineers, enterprise architects, software developers, and general technologists will be particularly interested in this chapter. You will become familiar with the strengths of H2O's MOJO as a scoring artifact, and how easily it can be deployed to a great diversity of enterprise systems. You will finish by writing a batch file scoring program that embeds a MOJO to demonstrate this flexibility.

Chapter 10, H2O Model Deployment Patterns, explores the many ways a MOJO can be deployed. You will first overview a diverse sampling of possible deployment patterns, and then drill down to implementation details of each. The patterns cover real-time streaming and batch scoring on a variety of specialized H2O scoring software, third-party integrations, and your own custom-built systems.

Chapter 11, The Administrator and Operations Views, starts our focus on enterprise stakeholder perspectives of ML at scale with H2O. Although focused on enterprise stakeholder activities and concerns, data scientists are shown how they relate to their own activities. In this chapter, system administrators and operators will learn in detail how Enterprise Steam is configured, and how users are secured and managed so data scientists can self-provision environments in a governed way. We will also identify operations activities around maintaining and troubleshooting H2O workloads and components.

Chapter 12, The Enterprise Architect and Security Views, covers the enterprise architect and security perspectives of H2O at Scale components. You will understand in detail the implementation alternatives of H2O and how the components integrate, communicate, and deploy. You will see that the H2O at Scale framework can be deployed on its own or as a member of the much larger H2O AI Cloud, which we cover in the next chapter.

Chapter 13, Introducing H2O AI Cloud, overviews H2O.ai's full end-to-end ML life cycle platform. The H2O at Scale framework and everything covered in the book to this point is a smaller subset of the H2O AI Cloud. In this chapter, we will overview H2O AI components and their key features, including four specialized model-building engines, a full-featured MLOps and Feature Store, and an open source low-code SDK to build and integrate AI Apps and host them on an Appstore.

Chapter 14, H2O at Scale in a Larger Platform Context, finishes the book by taking everything we have learned and showing how the H2O at Scale framework acquires categorically new and exciting possibilities when used as a part of the H2O AI Cloud. We provide examples of these possibilities and then present a reference enterprise integration framework using H2O for you to imagine your own possibilities.

Appendix, Alternative Methods to Launch H2O Clusters, shows different ways you can create H2O environments to run the code samples in this book.

To get the most out of this book

To run code samples, you will need to set up an H2O environment. The Appendix shows you the three ways to do this. Software versions are described there.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

From a conceptual standpoint, we introduce broad ML concepts before relating how H2O implements them. Therefore, you should be able to understand discussions in the book from your larger framework of ML understanding.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Machine-Learning-at-Scale-with-H2O. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/sDmtM.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The h2o.explain and h2o.explain_row methods bundle a set of explainability functions and visualizations for global and local explanations, respectively."

A block of code is set as follows:

from pysparkling import *
import h2o
hc = H2Ocontext.getOrCreate()
hc

Any command-line input or output is written as follows:

PYSPARK_DRIVER_PYTHON="ipython"

PYSPARK_DRIVER_PYTHON_OPTS="notebook"

bin/pysparkling

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Under the Admin menu in Flow, the top three options are Jobs, Cluster Status, and Water Meter."

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Machine Learning at Scale with H2O, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset